2025-05-27-12-16
Implementing Agents in JavaScript
Abstract
arXiv:2505.18228v1 Announce Type: new Abstract: This chapter gives an introduction to agent-oriented programming in JavaScript. It provides an example-based walk-through of how to implement abstractions for reasoning loop agents in vanilla JavaScript. The initial example is used as a stepping stone for explaining how to implement slightly more advanced agents and multi-agent systems using JS-son, a JavaScript library for agent-oriented programming. In this context, the chapter also explains how to integrate reasoning loop agents with generative AI technologies--specifically, large language models. Finally, application scenarios in several technology ecosystems and future research directions are sketched.
摘要
本章介绍了JavaScript中的面向智能体编程方法。通过示例演示了如何在原生JavaScript中实现推理循环智能体的抽象概念。初始示例作为基础,进一步阐释了如何运用JS-son(一个面向智能体编程的JavaScript库)来实现更高级的智能体与多智能体系统。在此背景下,本章还探讨了如何将推理循环智能体与生成式人工智能技术——特别是大语言模型——进行整合。最后,概述了该技术在多个技术生态系统中的应用场景及未来研究方向。
An Outlook on the Opportunities and Challenges of Multi-Agent AI Systems
Abstract
arXiv:2505.18397v1 Announce Type: new Abstract: Multi-agent AI systems (MAS) offer a promising framework for distributed intelligence, enabling collaborative reasoning, planning, and decision-making across autonomous agents. This paper provides a systematic outlook on the current opportunities and challenges of MAS, drawing insights from recent advances in large language models (LLMs), federated optimization, and human-AI interaction. We formalize key concepts including agent topology, coordination protocols, and shared objectives, and identify major risks such as dependency, misalignment, and vulnerabilities arising from training data overlap. Through a biologically inspired simulation and comprehensive theoretical framing, we highlight critical pathways for developing robust, scalable, and secure MAS in real-world settings.
摘要
多智能体人工智能系统(MAS)为分布式智能提供了一个前景广阔的框架,能够实现自主智能体间的协同推理、规划与决策。本文基于大语言模型(LLMs)、联邦优化和人机交互领域的最新进展,系统性地阐述了当前MAS面临的机遇与挑战。我们形式化定义了智能体拓扑结构、协调协议和共享目标等关键概念,并识别出训练数据重叠导致的依赖性、目标偏差和系统脆弱性等主要风险。通过仿生学模拟实验和综合理论框架,我们重点探讨了在现实场景中开发鲁棒、可扩展且安全的MAS的关键路径。
Pedagogy-R1: Pedagogically-Aligned Reasoning Model with Balanced Educational Benchmark
Abstract
arXiv:2505.18467v1 Announce Type: new Abstract: Recent advances in large reasoning models (LRMs) show strong performance in structured domains such as mathematics and programming; however, they often lack pedagogical coherence and realistic teaching behaviors. To bridge this gap, we introduce Pedagogy-R1, a framework that adapts LRMs for classroom use through three innovations: (1) a distillation-based pipeline that filters and refines model outputs for instruction-tuning, (2) the Well-balanced Educational Benchmark (WBEB), which evaluates performance across subject knowledge, pedagogical knowledge, tracing, essay scoring, and teacher decision-making, and (3) a Chain-of-Pedagogy (CoP) prompting strategy for generating and eliciting teacher-style reasoning. Our mixed-method evaluation combines quantitative metrics with qualitative analysis, providing the first systematic assessment of LRMs' pedagogical strengths and limitations.
摘要
大规模推理模型(LRMs)近期在数学和编程等结构化领域展现出卓越性能,但其往往缺乏教学连贯性与真实教学行为。为弥合这一差距,我们提出Pedagogy-R1框架,通过三项创新实现LRMs的课堂适配:(1)基于蒸馏的流程,对模型输出进行教学调优的筛选与精炼;(2)均衡教育基准(WBEB),从学科知识、教学法知识、学习轨迹追踪、论文评分及教师决策五个维度评估性能;(3)教学链(CoP)提示策略,用于生成和引导教师风格推理。我们采用混合方法评估,结合量化指标与质性分析,首次系统评估了LRMs的教学优势与局限。
Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary
Abstract
arXiv:2505.18325v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, yet they often refuse to answer legitimate queries-a phenomenon known as overrefusal. Overrefusal typically stems from over-conservative safety alignment, causing models to treat many reasonable prompts as potentially risky. To systematically understand this issue, we probe and leverage the models'safety decision boundaries to analyze and mitigate overrefusal. Our findings reveal that overrefusal is closely tied to misalignment at these boundary regions, where models struggle to distinguish subtle differences between benign and harmful content. Building on these insights, we present RASS, an automated framework for prompt generation and selection that strategically targets overrefusal prompts near the safety boundary. By harnessing steering vectors in the representation space, RASS efficiently identifies and curates boundary-aligned prompts, enabling more effective and targeted mitigation of overrefusal. This approach not only provides a more precise and interpretable view of model safety decisions but also seamlessly extends to multilingual scenarios.We have explored the safety decision boundaries of various LLMs and construct the MORBench evaluation set to facilitate robust assessment of model safety and helpfulness across multiple languages. Code and datasets will be released at https://anonymous.4open.science/r/RASS-80D3.
摘要
大语言模型(LLMs)在广泛的任务中展现出卓越能力,却经常拒绝回答合理查询——这种现象称为过度拒绝。过度拒绝通常源于过度保守的安全对齐机制,导致模型将许多合理提示视为潜在风险。为系统研究该问题,我们通过探测并利用模型的安全决策边界来分析及缓解过度拒绝。研究发现,过度拒绝与边界区域的错位密切相关,这些区域中模型难以区分良性内容与有害内容的细微差异。基于此,我们提出RASS框架——一种针对安全边界附近过度拒绝提示的自动化生成与选择策略。通过利用表征空间的导向向量,RASS高效识别并筛选边界对齐提示,实现更精准定向的过度拒绝缓解。该方法不仅为模型安全决策提供了更精确可解释的视角,还能无缝扩展至多语言场景。我们探索了多种LLMs的安全决策边界,并构建MORBench评估集以促进跨语言模型安全性与实用性的稳健评估。代码与数据集将在https://anonymous.4open.science/r/RASS-80D3发布。
Persona Alchemy: Designing, Evaluating, and Implementing Psychologically-Grounded LLM Agents for Diverse Stakeholder Representation
Abstract
arXiv:2505.18351v1 Announce Type: new Abstract: Despite advances in designing personas for Large Language Models (LLM), challenges remain in aligning them with human cognitive processes and representing diverse stakeholder perspectives. We introduce a Social Cognitive Theory (SCT) agent design framework for designing, evaluating, and implementing psychologically grounded LLMs with consistent behavior. Our framework operationalizes SCT through four personal factors (cognitive, motivational, biological, and affective) for designing, six quantifiable constructs for evaluating, and a graph database-backed architecture for implementing stakeholder personas. Experiments tested agents' responses to contradicting information of varying reliability. In the highly polarized renewable energy transition discourse, we design five diverse agents with distinct ideologies, roles, and stakes to examine stakeholder representation. The evaluation of these agents in contradictory scenarios occurs through comprehensive processes that implement the SCT. Results show consistent response patterns ( range: ) and systematic temporal development of SCT construct effects. Principal component analysis identifies two dimensions explaining % of variance, validating the theoretical structure. Our framework offers improved explainability and reproducibility compared to black-box approaches. This work contributes to ongoing efforts to improve diverse stakeholder representation while maintaining psychological consistency in LLM personas.
摘要
尽管在设计大型语言模型(LLM)角色方面取得了进展,但在使其与人类认知过程保持一致及呈现多元利益相关者视角方面仍存在挑战。我们提出一种基于社会认知理论(SCT)的智能体设计框架,用于设计、评估和实现具有行为一致性的心理学基础LLM。该框架通过四大个人因素(认知、动机、生物和情感)进行设计,六个可量化构念进行评估,并采用图数据库支撑的架构来实现利益相关者角色建模。实验测试了智能体对不同可靠性矛盾信息的响应。在高度两极化的可再生能源转型讨论中,我们设计了五个具有不同意识形态、角色和利益诉求的多样化智能体,以检验利益相关者表征效果。通过实施SCT的综合流程,对这些智能体在矛盾情境中的表现进行评估。结果显示出一致的响应模式(R²范围:0.58-0.61)以及SCT构念效应的系统性时序发展。主成分分析识别出两个解释73%方差的维度,验证了理论结构。相较于黑箱方法,本框架提供了更好的可解释性和可复现性。这项工作为在保持LLM角色心理一致性的同时提升多元利益相关者表征能力的研究做出了贡献。
Single-agent or Multi-agent Systems? Why Not Both?
Abstract
arXiv:2505.18286v1 Announce Type: new Abstract: Multi-agent systems (MAS) decompose complex tasks and delegate subtasks to different large language model (LLM) agents and tools. Prior studies have reported the superior accuracy performance of MAS across diverse domains, enabled by long-horizon context tracking and error correction through role-specific agents. However, the design and deployment of MAS incur higher complexity and runtime cost compared to single-agent systems (SAS). Meanwhile, frontier LLMs, such as OpenAI-o3 and Gemini-2.5-Pro, have rapidly advanced in long-context reasoning, memory retention, and tool usage, mitigating many limitations that originally motivated MAS designs. In this paper, we conduct an extensive empirical study comparing MAS and SAS across various popular agentic applications. We find that the benefits of MAS over SAS diminish as LLM capabilities improve, and we propose efficient mechanisms to pinpoint the error-prone agent in MAS. Furthermore, the performance discrepancy between MAS and SAS motivates our design of a hybrid agentic paradigm, request cascading between MAS and SAS, to improve both efficiency and capability. Our design improves accuracy by 1.1-12% while reducing deployment costs by up to 20% across various agentic applications.
摘要
多智能体系统(MAS)通过将复杂任务分解并分配给不同的大语言模型(LLM)智能体与工具来实现任务处理。先前研究表明,得益于角色专属智能体的长程上下文追踪与错误修正能力,MAS在多个领域展现出卓越的准确性。然而相较于单智能体系统(SAS),MAS的设计与部署具有更高的复杂性和运行时成本。与此同时,前沿LLM(如OpenAI-o3和Gemini-2.5-Pro)在长上下文推理、记忆保持和工具使用方面快速进步,消解了许多最初促使MAS设计的局限性。本文通过大量实证研究对比了MAS与SAS在各类主流智能体应用中的表现,发现随着LLM能力的提升,MAS相对于SAS的优势逐渐减弱,并提出高效机制以定位MAS中易出错的智能体。此外,MAS与SAS的性能差异促使我们设计出一种混合智能体范式——在MAS与SAS之间进行请求级联,以同步提升效率与能力。该设计在各类智能体应用中实现1.1-12%的准确率提升,同时降低最高达20%的部署成本。
RedactOR: An LLM-Powered Framework for Automatic Clinical Data De-Identification
Abstract
arXiv:2505.18380v1 Announce Type: new Abstract: Ensuring clinical data privacy while preserving utility is critical for AI-driven healthcare and data analytics. Existing de-identification (De-ID) methods, including rule-based techniques, deep learning models, and large language models (LLMs), often suffer from recall errors, limited generalization, and inefficiencies, limiting their real-world applicability. We propose a fully automated, multi-modal framework, RedactOR for de-identifying structured and unstructured electronic health records, including clinical audio records. Our framework employs cost-efficient De-ID strategies, including intelligent routing, hybrid rule and LLM based approaches, and a two-step audio redaction approach. We present a retrieval-based entity relexicalization approach to ensure consistent substitutions of protected entities, thereby enhancing data coherence for downstream applications. We discuss key design desiderata, de-identification and relexicalization methodology, and modular architecture of RedactX and its integration with the Oracle Health Clinical AI system. Evaluated on the i2b2 2014 De-ID dataset using standard metrics with strict recall, our approach achieves competitive performance while optimizing token usage to reduce LLM costs. Finally, we discuss key lessons and insights from deployment in real-world AI- driven healthcare data pipelines.
摘要
确保临床数据隐私同时保持其实用性,对于AI驱动的医疗保健和数据分析至关重要。现有的去标识化(De-ID)方法,包括基于规则的技术、深度学习模型和大语言模型(LLMs),常存在召回错误、泛化能力有限和效率低下等问题,限制了其实际应用。我们提出了一种全自动多模态框架RedactOR,用于对结构化和非结构化电子健康记录(包括临床音频记录)进行去标识化处理。该框架采用高性价比的去标识化策略,包括智能路由、基于混合规则与LLM的方法,以及两步式音频脱敏方法。我们提出了一种基于检索的实体重词汇化方法,以确保对受保护实体进行一致替换,从而增强下游应用的数据连贯性。本文详细阐述了关键设计需求、去标识化与重词汇化方法、RedactX的模块化架构及其与Oracle Health临床AI系统的集成。在i2b2 2014 De-ID数据集上采用严格召回标准进行评估,我们的方法在优化令牌使用以降低LLM成本的同时,取得了具有竞争力的性能。最后,我们讨论了在实际AI驱动的医疗数据管道部署过程中获得的重要经验与见解。
A Survey of LLM DATA
Abstract
arXiv:2505.18458v1 Announce Type: new Abstract: The integration of large language model (LLM) and data management (DATA) is rapidly redefining both domains. In this survey, we comprehensively review the bidirectional relationships. On the one hand, DATA4LLM, spanning large-scale data processing, storage, and serving, feeds LLMs with high quality, diversity, and timeliness of data required for stages like pre-training, post-training, retrieval-augmented generation, and agentic workflows: (i) Data processing for LLMs includes scalable acquisition, deduplication, filtering, selection, domain mixing, and synthetic augmentation; (ii) Data Storage for LLMs focuses on efficient data and model formats, distributed and heterogeneous storage hierarchies, KV-cache management, and fault-tolerant checkpointing; (iii) Data serving for LLMs tackles challenges in RAG (e.g., knowledge post-processing), LLM inference (e.g., prompt compression, data provenance), and training strategies (e.g., data packing and shuffling). On the other hand, in LLM4DATA, LLMs are emerging as general-purpose engines for data management. We review recent advances in (i) data manipulation, including automatic data cleaning, integration, discovery; (ii) data analysis, covering reasoning over structured, semi-structured, and unstructured data, and (iii) system optimization (e.g., configuration tuning, query rewriting, anomaly diagnosis), powered by LLM techniques like retrieval-augmented prompting, task-specialized fine-tuning, and multi-agent collaboration.
摘要
大语言模型(LLM)与数据管理(DATA)的融合正在迅速重塑这两个领域。本综述全面审视了双向关系。一方面,DATA4LLM涵盖大规模数据处理、存储与服务,为LLM的预训练、后训练、检索增强生成和智能体工作流等阶段提供高质量、多样化和时效性的数据支持:(i)面向LLM的数据处理包括可扩展采集、去重、过滤、选择、领域混合和合成增强;(ii)LLM数据存储聚焦高效数据与模型格式、分布式异构存储层次、KV缓存管理和容错检查点;(iii)LLM数据服务应对RAG(如知识后处理)、LLM推理(如提示压缩、数据溯源)和训练策略(如数据打包与混洗)等挑战。另一方面,在LLM4DATA中,LLM正成为数据管理的通用引擎。我们梳理了最新进展:(i)数据操作,包括自动化数据清洗、集成与发现;(ii)数据分析,涵盖结构化、半结构化和非结构化数据的推理;(iii)系统优化(如配置调优、查询重写、异常诊断),这些进步得益于检索增强提示、任务专用微调与多智能体协作等LLM技术。
LiSTEN: Learning Soft Token Embeddings for Neural Audio LLMs
Abstract
arXiv:2505.18517v1 Announce Type: new Abstract: Foundation models based on large language models (LLMs) have shown great success in handling various tasks and modalities. However, adapting these models for general-purpose audio-language tasks is challenging due to differences in acoustic environments and task variations. In this work, we introduce LiSTEN Learning Soft Token Embeddings for Neural Audio LLMs), a framework for adapting LLMs to speech and audio tasks. LiSTEN uses a dynamic prompt selection strategy with learnable key-value pairs, allowing the model to balance general and task-specific knowledge while avoiding overfitting in a multitask setting. Our approach reduces dependence on large-scale ASR or captioning datasets, achieves competitive performance with fewer trainable parameters, and simplifies training by using a single-stage process. Additionally, LiSTEN enhances interpretability by analyzing the diversity and overlap of selected prompts across different tasks.
摘要
基于大型语言模型(LLM)的基础模型在处理多种任务和模态方面已展现出卓越成效。然而,由于声学环境的差异和任务多样性,将这些模型适配于通用音频-语言任务仍具挑战性。本研究提出LiSTEN(面向神经音频LLM的可学习软令牌嵌入框架),通过动态提示选择策略与可学习的键值对,使LLM能够适应语音和音频任务。该框架既能平衡通用知识与任务特定知识,又可避免多任务场景下的过拟合问题。我们的方法降低了对大规模自动语音识别或字幕数据集的依赖,以更少的可训练参数实现竞争性性能,并采用单阶段训练流程简化训练过程。此外,LiSTEN通过分析不同任务间所选提示的多样性与重叠性,增强了模型的可解释性。
MRGAgents: A Multi-Agent Framework for Improved Medical Report Generation with Med-LVLMs
Abstract
arXiv:2505.18530v1 Announce Type: new Abstract: Medical Large Vision-Language Models (Med-LVLMs) have been widely adopted for medical report generation. Despite Med-LVLMs producing state-of-the-art performance, they exhibit a bias toward predicting all findings as normal, leading to reports that overlook critical abnormalities. Furthermore, these models often fail to provide comprehensive descriptions of radiologically relevant regions necessary for accurate diagnosis. To address these challenges, we proposeMedical Report Generation Agents (MRGAgents), a novel multi-agent framework that fine-tunes specialized agents for different disease categories. By curating subsets of the IU X-ray and MIMIC-CXR datasets to train disease-specific agents, MRGAgents generates reports that more effectively balance normal and abnormal findings while ensuring a comprehensive description of clinically relevant regions. Our experiments demonstrate that MRGAgents outperformed the state-of-the-art, improving both report comprehensiveness and diagnostic utility.
摘要
医学大型视觉语言模型(Med-LVLMs)已被广泛应用于医学报告生成。尽管Med-LVLMs表现出最先进的性能,但它们存在将所有检查结果预测为正常的倾向,导致生成的报告忽略关键异常。此外,这些模型往往未能提供准确诊断所需的放射学相关区域的全面描述。为解决这些问题,我们提出医学报告生成代理(MRGAgents),这是一种新颖的多代理框架,通过微调针对不同疾病类别的专用代理。通过筛选IU X-ray和MIMIC-CXR数据集的子集以训练疾病特异性代理,MRGAgents生成的报告能更有效地平衡正常与异常结果,同时确保对临床相关区域的全面描述。实验表明,MRGAgents在报告全面性和诊断实用性方面均优于现有最先进方法。
Retrieval Augmented Decision-Making: A Requirements-Driven, Multi-Criteria Framework for Structured Decision Support
Abstract
arXiv:2505.18483v1 Announce Type: new Abstract: Various industries have produced a large number of documents such as industrial plans, technical guidelines, and regulations that are structurally complex and content-wise fragmented. This poses significant challenges for experts and decision-makers in terms of retrieval and understanding. Although existing LLM-based Retrieval-Augmented Generation methods can provide context-related suggestions, they lack quantitative weighting and traceable reasoning paths, making it difficult to offer multi-level and transparent decision support. To address this issue, this paper proposes the RAD method, which integrates Multi-Criteria Decision Making with the semantic understanding capabilities of LLMs. The method automatically extracts key criteria from industry documents, builds a weighted hierarchical decision model, and generates structured reports under model guidance. The RAD framework introduces explicit weight assignment and reasoning chains in decision generation to ensure accuracy, completeness, and traceability. Experiments show that in various decision-making tasks, the decision reports generated by RAD significantly outperform existing methods in terms of detail, rationality, and structure, demonstrating its application value and potential in complex decision support scenarios.
摘要
各行业产生了大量结构复杂、内容零散的工业规划、技术指南和法规文件,这给专家和决策者的检索与理解带来重大挑战。尽管现有基于大语言模型的检索增强生成方法能提供上下文相关建议,但缺乏定量权重和可追溯的推理路径,难以提供多层次、透明的决策支持。针对该问题,本文提出融合多准则决策与大语言模型语义理解能力的RAD方法,该方法能自动从行业文档中提取关键准则,构建加权层次化决策模型,并在模型指导下生成结构化报告。RAD框架在决策生成中引入显式权重分配和推理链,确保准确性、完整性和可追溯性。实验表明,在各类决策任务中,RAD生成的决策报告在细节性、合理性和结构性方面显著优于现有方法,展现了其在复杂决策支持场景中的应用价值与潜力。
RoleRAG: Enhancing LLM Role-Playing via Graph Guided Retrieval
Abstract
arXiv:2505.18541v1 Announce Type: new Abstract: Large Language Models (LLMs) have shown promise in character imitation, enabling immersive and engaging conversations. However, they often generate content that is irrelevant or inconsistent with a character's background. We attribute these failures to: (1) the inability to accurately recall character-specific knowledge due to entity ambiguity, and (2) a lack of awareness of the character's cognitive boundaries. To address these issues, we propose RoleRAG, a retrieval-based framework that integrates efficient entity disambiguation for knowledge indexing with a boundary-aware retriever for extracting contextually appropriate information from a structured knowledge graph. Experiments on role-playing benchmarks show that RoleRAG's calibrated retrieval helps both general-purpose and role-specific LLMs better align with character knowledge and reduce hallucinated responses.
摘要
大语言模型(LLMs)在角色模仿方面展现出潜力,能够实现沉浸式且引人入胜的对话。然而,其生成内容常出现与角色背景无关或不一致的问题。我们将这些缺陷归因于:(1)因实体歧义而无法准确回忆角色特定知识;(2)缺乏对角色认知边界的意识。为解决这些问题,我们提出RoleRAG框架,该检索增强方法结合了高效实体消歧的知识索引技术,以及边界感知检索器,用于从结构化知识图谱中提取符合上下文的信息。角色扮演基准测试表明,RoleRAG的校准检索机制能帮助通用型和角色专用型LLMs更好地对齐角色知识,并减少幻觉响应。
Seeing Beyond Words: MatVQA for Challenging Visual-Scientific Reasoning in Materials Science
Abstract
arXiv:2505.18319v1 Announce Type: new Abstract: The emergence of Multimodal Large Language Models (MLLMs) that integrate vision and language modalities has unlocked new potentials for scientific reasoning, outperforming prior benchmarks in both natural language and coding domains. Current materials science evaluation datasets such as MaScQA and SciQA remain largely text-based and fail to capture the visual and research-level analytic complexity required in materials discovery and design. We introduce MatVQA, a scalable benchmark specifically designed to address this gap. Generated via an automated pipeline, MArxivAgent, from recent materials literature, MatVQA features 1325 questions across four critical structure-property-performance (SPP) reasoning tasks. Uniquely, MatVQA employs an iterative process to eliminate textual shortcuts, compelling MLLMs to perform fine-grained, low-level visual analysis of material imagery (e.g., microscopy, diffraction patterns) integrated with multi-step scientific reasoning. Benchmarking 17 open- and closed-source MLLMs on MatVQA reveals substantial gaps in current multimodal reasoning capabilities. MatVQA benchmark data, along with evaluation code, is publicly available in \href{https://anonymous.4open.science/r/matvqa-1E01}{https://anonymous.4open.science/r/matvqa-1E01/README.md} to catalyze further research in applying MLLMs to complex materials science problems.
摘要
多模态大语言模型(MLLMs)通过整合视觉与语言模态,在科学推理领域展现出新的潜力,其表现已超越自然语言和编程领域的既有基准。当前材料科学评估数据集(如MaScQA和SciQA)仍主要基于文本,未能涵盖材料发现与设计所需的视觉信息及研究级分析复杂度。为此,我们提出MatVQA——一个专为解决此问题设计的可扩展基准。该数据集通过自动化流程MArxivAgent从最新材料学文献生成,包含1325个问题,覆盖四种关键的结构-性能-功能(SPP)推理任务。MatVQA采用迭代流程消除文本捷径,迫使MLLMs对材料图像(如显微图像、衍射图谱)进行细粒度底层视觉分析,并与多步骤科学推理相结合。对17个开源与闭源MLLMs的基准测试揭示了当前多模态推理能力的显著不足。MatVQA基准数据及评估代码已公开于\href{https://anonymous.4open.science/r/matvqa-1E01}{https://anonymous.4open.science/r/matvqa-1E01/README.md},以推动MLLMs在复杂材 料科学问题中的应用研究。
Collaborative Memory: Multi-User Memory Sharing in LLM Agents with Dynamic Access Control
Abstract
arXiv:2505.18279v1 Announce Type: new Abstract: Complex tasks are increasingly delegated to ensembles of specialized LLM-based agents that reason, communicate, and coordinate actions-both among themselves and through interactions with external tools, APIs, and databases. While persistent memory has been shown to enhance single-agent performance, most approaches assume a monolithic, single-user context-overlooking the benefits and challenges of knowledge transfer across users under dynamic, asymmetric permissions. We introduce Collaborative Memory, a framework for multi-user, multi-agent environments with asymmetric, time-evolving access controls encoded as bipartite graphs linking users, agents, and resources. Our system maintains two memory tiers: (1) private memory-private fragments visible only to their originating user; and (2) shared memory-selectively shared fragments. Each fragment carries immutable provenance attributes (contributing agents, accessed resources, and timestamps) to support retrospective permission checks. Granular read policies enforce current user-agent-resource constraints and project existing memory fragments into filtered transformed views. Write policies determine fragment retention and sharing, applying context-aware transformations to update the memory. Both policies may be designed conditioned on system, agent, and user-level information. Our framework enables safe, efficient, and interpretable cross-user knowledge sharing, with provable adherence to asymmetric, time-varying policies and full auditability of memory operations.
摘要
复杂任务正越来越多地委托给由专业化基于大语言模型的智能体组成的协作系统,这些智能体能够进行推理、通信和协调行动——既通过彼此间的交互,也通过与外部工具、API及数据库的互动。虽然持久性记忆已被证明能提升单智能体性能,但现有方法大多基于单一用户场景下的单体架构,忽视了动态非对称权限环境下跨用户知识迁移的效益与挑战。我们提出协作记忆框架,这是一种适用于多用户多智能体环境的解决方案,其通过二分图编码用户、智能体与资源之间非对称且随时间演化的访问控制关系。该系统维护双层记忆结构:(1)仅对创建者可见的私有记忆片段;(2)选择性共享的公共记忆片段。每个片段均携带不可篡改的溯源属性(贡献智能体、访问资源及时间戳)以支持追溯式权限校验。细粒度读取策略强制执行当前用户-智能体-资源约束,并将现有记忆片段投影为经过筛选的转换视图。写入策略通过上下文感知的转换操作决定片段的保留与共享方式。这两类策略均可基于系统、智能体及用户层级信息进行定制设计。本框架实现了安全高效且可解释的跨用户知识共享,可证明地遵循非对称时变策略,并确保所有记忆操作具备完全可审计性。
Generative RLHF-V: Learning Principles from Multi-modal Human Preference
Abstract
arXiv:2505.18531v1 Announce Type: new Abstract: Training multi-modal large language models (MLLMs) that align with human intentions is a long-term challenge. Traditional score-only reward models for alignment suffer from low accuracy, weak generalization, and poor interpretability, blocking the progress of alignment methods, e.g., reinforcement learning from human feedback (RLHF). Generative reward models (GRMs) leverage MLLMs' intrinsic reasoning capabilities to discriminate pair-wise responses, but their pair-wise paradigm makes it hard to generalize to learnable rewards. We introduce Generative RLHF-V, a novel alignment framework that integrates GRMs with multi-modal RLHF. We propose a two-stage pipeline: \textbf{multi-modal generative reward modeling from RL}, where RL guides GRMs to actively capture human intention, then predict the correct pair-wise scores; and \textbf{RL optimization from grouped comparison}, which enhances multi-modal RL scoring precision by grouped responses comparison. Experimental results demonstrate that, besides out-of-distribution generalization of RM discrimination, our framework improves 4 MLLMs' performance across 7 benchmarks by , while the baseline RLHF is only . We further validate that Generative RLHF-V achieves a near-linear improvement with an increasing number of candidate responses. Our code and models can be found at https://generative-rlhf-v.github.io.
摘要
训练符合人类意图的多模态大语言模型(MLLMs)是一项长期挑战。传统基于单一分数的奖励模型在对齐任务中存在准确率低、泛化能力弱和可解释性差等问题,阻碍了强化学习人类反馈(RLHF)等对齐方法的进展。生成式奖励模型(GRMs)利用MLLMs固有的推理能力判别成对响应,但其成对范式难以推广至可学习的奖励机制。本文提出Generative RLHF-V——一个将GRMs与多模态RLHF相结合的新型对齐框架,采用两阶段流程:基于强化学习的多模态生成式奖励建模,通过强化学习引导GRMs主动捕捉人类意图并预测成对分数;以及基于分组比较的强化学习优化,通过响应分组比较提升多模态强化学习的评分精度。实验结果表明,除奖励模型的分布外泛化能力外,本框架在7个基准测试中将4种MLLMs性能提升18.1%,而基线RLHF仅提升5.3%。进一步验证表明,随着候选响应数量增加,Generative RLHF-V可实现近线性改进。代码与模型详见https://generative-rlhf-v.github.io。
Knowledge Grafting of Large Language Models
Abstract
arXiv:2505.18502v1 Announce Type: new Abstract: Cross-capability transfer is a key challenge in large language model (LLM) research, with applications in multi-task integration, model compression, and continual learning. Recent works like FuseLLM and FuseChat have demonstrated the potential of transferring multiple model capabilities to lightweight models, enhancing adaptability and efficiency, which motivates our investigation into more efficient cross-capability transfer methods. However, existing approaches primarily focus on small, homogeneous models, limiting their applicability. For large, heterogeneous models, knowledge distillation with full-parameter fine-tuning often overlooks the student model's intrinsic capacity and risks catastrophic forgetting, while PEFT methods struggle to effectively absorb knowledge from source LLMs. To address these issues, we introduce GraftLLM, a novel method that stores source model capabilities in a target model with SkillPack format. This approach preserves general capabilities, reduces parameter conflicts, and supports forget-free continual learning and model fusion. We employ a module-aware adaptive compression strategy to compress parameter updates, ensuring efficient storage while maintaining task-specific knowledge. The resulting SkillPack serves as a compact and transferable knowledge carrier, ideal for heterogeneous model fusion and continual learning. Experiments across various scenarios demonstrate that GraftLLM outperforms existing techniques in knowledge transfer, knowledge fusion, and forget-free learning, providing a scalable and efficient solution for cross-capability transfer. The code is publicly available at: https://github.com/duguodong7/GraftLLM.
摘要
跨能力迁移是大型语言模型(LLM)研究中的关键挑战,其应用涵盖多任务集成、模型压缩和持续学习等领域。近期FuseLLM和FuseChat等研究表明,将多个模型能力迁移至轻量级模型可显著提升适应性与效率,这促使我们探索更高效的跨能力迁移方法。然而现有方法主要针对小型同构模型,限制了其适用性。对于大型异构模型,基于全参数微调的知识蒸馏往往忽视学生模型的固有容量并存在灾难性遗忘风险,而参数高效微调(PEFT)方法则难以有效吸收源LLM的知识。为此,我们提出GraftLLM——一种通过SkillPack格式将源模型能力存储至目标模型的新方法。该方法能保留通用能力、减少参数冲突,并支持无遗忘持续学习与模型融合。我们采用模块感知的自适应压缩策略对参数更新进行压缩,在保持任务特定知识的同时实现高效存储。生成的SkillPack可作为紧凑可迁移的知识载体,特别适用于异构模型融合与持续学习。多场景实验表明,GraftLLM在知识迁移、知识融合和无遗忘学习方面均优于现有技术,为跨能力迁移提供了可扩展的高效解决方案。代码已开源:https://github.com/duguodong7/GraftLLM。
PacTrain: Pruning and Adaptive Sparse Gradient Compression for Efficient Collective Communication in Distributed Deep Learning
Abstract
arXiv:2505.18563v1 Announce Type: new Abstract: Large-scale deep neural networks (DNN) exhibit excellent performance for various tasks. As DNNs and datasets grow, distributed training becomes extremely time-consuming and demands larger clusters. A main bottleneck is the resulting gradient aggregation overhead. While gradient compression and sparse collective communication techniques are commonly employed to alleviate network load, many gradient compression schemes do not achieve acceleration of the training process while also preserving accuracy. This paper introduces PacTrain, a novel framework that accelerates distributed training by combining pruning with sparse gradient compression. Active pruning of the neural network makes the model weights and gradients sparse. By ensuring the global knowledge of the gradient sparsity among all distributed training workers, we can perform lightweight compression communication without harming accuracy. We show that the PacTrain compression scheme achieves a near-optimal compression strategy while remaining compatible with the all-reduce primitive. Experimental evaluations show that PacTrain improves training throughput by 1.25 to 8.72 times compared to state-of-the-art compression-enabled systems for representative vision and language models training tasks under bandwidth-constrained conditions.
摘要
大规模深度神经网络(DNN)在各种任务中展现出卓越性能。随着DNN和数据集规模的增长,分布式训练变得极其耗时且需要更大规模的集群。主要瓶颈在于梯度聚合带来的通信开销。虽然梯度压缩和稀疏集体通信技术常被用于缓解网络负载,但多数梯度压缩方案无法在保持精度的同时加速训练过程。本文提出PacTrain框架,通过将剪枝与稀疏梯度压缩相结合来加速分布式训练。神经网络的主动剪枝使得模型权重和梯度具有稀疏性。通过确保所有分布式训练节点共享梯度稀疏性的全局认知,我们可以在不影响精度的情况下实现轻量级压缩通信。研究表明,PacTrain压缩方案实现了接近最优的压缩策略,同时保持与全归约原语的兼容性。实验评估表明,在带宽受限条件下,针对典型视觉和语言模型训练任务,PacTrain相比最先进的压缩系统将训练吞吐量提升了1.25至8.72倍。
Enumerate-Conjecture-Prove: Formally Solving Answer-Construction Problems in Math Competitions
Abstract
arXiv:2505.18492v1 Announce Type: new Abstract: Mathematical reasoning lies at the heart of artificial intelligence, underpinning applications in education, program verification, and research-level mathematical discovery. Mathematical competitions, in particular, present two challenging problem types: theorem-proving, requiring rigorous proofs of stated conclusions, and answer-construction, involving hypothesizing and formally verifying mathematical objects. Large Language Models (LLMs) effectively generate creative candidate answers but struggle with formal verification, while symbolic provers ensure rigor but cannot efficiently handle creative conjecture generation. We introduce the Enumerate-Conjecture-Prove (ECP) framework, a modular neuro-symbolic method integrating LLM-based enumeration and pattern-driven conjecturing with formal theorem proving. We present ConstructiveBench, a dataset of 3,431 answer-construction problems in various math competitions with verified Lean formalizations. On the ConstructiveBench dataset, ECP improves the accuracy of answer construction from the Chain-of-Thought (CoT) baseline of 14.54% to 45.06% with the gpt-4.1-mini model. Moreover, combining with ECP's constructed answers, the state-of-the-art DeepSeek-Prover-V2-7B model generates correct proofs for 858 of the 3,431 constructive problems in Lean, achieving 25.01% accuracy, compared to 9.86% for symbolic-only baselines. Our code and dataset are publicly available at GitHub and HuggingFace, respectively.
摘要
数学推理是人工智能的核心基础,支撑着教育、程序验证和研究级数学发现等应用领域。数学竞赛尤其呈现出两类具有挑战性的问题类型:定理证明(要求对既定结论进行严格证明)和答案构建(涉及数学对象的假设与形式化验证)。大语言模型(LLMs)能有效生成创造性候选答案,但在形式化验证方面存在不足;而符号证明器虽能确保严谨性,却无法高效处理创造性猜想生成。我们提出枚举-猜想-证明(ECP)框架,这是一种模块化神经符号方法,整合了基于LLM的枚举、模式驱动猜想与形式化定理证明。我们构建了ConstructiveBench数据集,包含3,431道各类数学竞赛中的答案构建问题,并配有经过验证的Lean形式化代码。在ConstructiveBench数据集上,ECP框架将答案构建的准确率从思维链(CoT)基线的14.54%提升至45.06%(使用gpt-4.1-mini模型)。此外,结合ECP构建的答案,最先进的DeepSeek-Prover-V2-7B模型为3,431道构造性问题中的858道生成了正确的Lean证明,准确率达25.01%,而纯符号基线的准确率仅为9.86%。我们的代码和数据集已分别在GitHub和HuggingFace平台公开。
MASTER: Multi-Agent Security Through Exploration of Roles and Topological Structures -- A Comprehensive Framework
Abstract
arXiv:2505.18572v1 Announce Type: new Abstract: Large Language Models (LLMs)-based Multi-Agent Systems (MAS) exhibit remarkable problem-solving and task planning capabilities across diverse domains due to their specialized agentic roles and collaborative interactions. However, this also amplifies the severity of security risks under MAS attacks. To address this, we introduce MASTER, a novel security research framework for MAS, focusing on diverse Role configurations and Topological structures across various scenarios. MASTER offers an automated construction process for different MAS setups and an information-flow-based interaction paradigm. To tackle MAS security challenges in varied scenarios, we design a scenario-adaptive, extensible attack strategy utilizing role and topological information, which dynamically allocates targeted, domain-specific attack tasks for collaborative agent execution. Our experiments demonstrate that such an attack, leveraging role and topological information, exhibits significant destructive potential across most models. Additionally, we propose corresponding defense strategies, substantially enhancing MAS resilience across diverse scenarios. We anticipate that our framework and findings will provide valuable insights for future research into MAS security challenges.
摘要
基于大语言模型(LLM)的多智能体系统(MAS)凭借其专业化的智能体角色与协同交互机制,在跨领域问题解决和任务规划方面展现出卓越能力。然而这也使得系统在遭受MAS攻击时的安全风险严重性被放大。为此,我们提出MASTER——一个面向多智能体系统安全研究的新型框架,重点关注不同场景下的角色配置与拓扑结构。该框架提供自动化构建多样化MAS配置的流程,以及基于信息流的交互范式。针对多场景下的MAS安全挑战,我们设计了一种利用角色与拓扑信息的场景自适应可扩展攻击策略,能够动态分配针对特定领域的目标攻击任务供智能体协作执行。实验表明,此类利用角色与拓扑信息的攻击对多数模型均具有显著破坏潜力。此外,我们提出了相应防御策略,可显著提升多场景下MAS的韧性。我们期待该框架及研究发现能为未来MAS安全挑战研究提供重要启示。
Response Uncertainty and Probe Modeling: Two Sides of the Same Coin in LLM Interpretability?
Abstract
arXiv:2505.18575v1 Announce Type: new Abstract: Probing techniques have shown promise in revealing how LLMs encode human-interpretable concepts, particularly when applied to curated datasets. However, the factors governing a dataset's suitability for effective probe training are not well-understood. This study hypothesizes that probe performance on such datasets reflects characteristics of both the LLM's generated responses and its internal feature space. Through quantitative analysis of probe performance and LLM response uncertainty across a series of tasks, we find a strong correlation: improved probe performance consistently corresponds to a reduction in response uncertainty, and vice versa. Subsequently, we delve deeper into this correlation through the lens of feature importance analysis. Our findings indicate that high LLM response variance is associated with a larger set of important features, which poses a greater challenge for probe models and often results in diminished performance. Moreover, leveraging the insights from response uncertainty analysis, we are able to identify concrete examples where LLM representations align with human knowledge across diverse domains, offering additional evidence of interpretable reasoning in LLMs.
摘要
探测技术在揭示大语言模型如何编码人类可解释概念方面展现出潜力,尤其在应用于精选数据集时表现突出。然而,目前对决定数据集是否适合有效训练探测器的因素仍缺乏深入理解。本研究提出假设:此类数据集上的探测器性能同时反映了大语言模型生成响应及其内部特征空间的特性。通过对系列任务中探测器性能与模型响应不确定性的定量分析,我们发现存在强相关性:探测器性能提升始终伴随响应不确定性的降低,反之亦然。随后,我们通过特征重要性分析的视角深入探究这种关联。研究结果表明,高模型响应方差与更庞大的重要特征集合相关,这为探测模型带来了更大挑战并通常导致性能下降。此外,基于响应不确定性分析的发现,我们能够识别出大语言模型表征与多领域人类知识相吻合的具体案例,这为模型可解释推理提供了新的实证依据。
RvLLM: LLM Runtime Verification with Domain Knowledge
Abstract
arXiv:2505.18585v1 Announce Type: new Abstract: Large language models (LLMs) have emerged as a dominant AI paradigm due to their exceptional text understanding and generation capabilities. However, their tendency to generate inconsistent or erroneous outputs challenges their reliability, especially in high-stakes domains requiring accuracy and trustworthiness. Existing research primarily focuses on detecting and mitigating model misbehavior in general-purpose scenarios, often overlooking the potential of integrating domain-specific knowledge. In this work, we advance misbehavior detection by incorporating domain knowledge. The core idea is to design a general specification language that enables domain experts to customize domain-specific predicates in a lightweight and intuitive manner, supporting later runtime verification of LLM outputs. To achieve this, we design a novel specification language, ESL, and introduce a runtime verification framework, RvLLM, to validate LLM output against domain-specific constraints defined in ESL. We evaluate RvLLM on three representative tasks: violation detection against Singapore Rapid Transit Systems Act, numerical comparison, and inequality solving. Experimental results demonstrate that RvLLM effectively detects erroneous outputs across various LLMs in a lightweight and flexible manner. The results reveal that despite their impressive capabilities, LLMs remain prone to low-level errors due to limited interpretability and a lack of formal guarantees during inference, and our framework offers a potential long-term solution by leveraging expert domain knowledge to rigorously and efficiently verify LLM outputs.
摘要
大型语言模型(LLMs)因其卓越的文本理解和生成能力,已成为人工智能领域的主导范式。然而,其生成不一致或错误输出的倾向对可靠性提出了挑战,尤其是在需要精确性和可信度的高风险领域。现有研究主要集中于检测和缓解通用场景中的模型错误行为,往往忽视了整合领域特定知识的潜力。本研究通过融入领域知识,推进了错误行为检测。核心思想是设计一种通用规范语言,使领域专家能够以轻量级且直观的方式定制领域特定谓词,从而支持后续对LLM输出的运行时验证。为此,我们设计了一种新颖的规范语言ESL,并引入了一个运行时验证框架RvLLM,用于根据ESL中定义的领域特定约束验证LLM输出。我们在三个代表性任务上评估了RvLLM:新加坡快速交通系统法案违规检测、数值比较和不等式求解。实验结果表明,RvLLM以轻量级且灵活的方式有效检测了各种LLM的错误输出。结果揭示,尽管LLM能力出众,但由于推理过程中可解释性有限且缺乏形式化保证,它们仍容易犯低级错误,而我们的框架通过利用专家领域知识严格高效地验证LLM输出,提供了潜在的长期解决方案。
LLMs for Supply Chain Management
Abstract
arXiv:2505.18597v1 Announce Type: new Abstract: The development of large language models (LLMs) has provided new tools for research in supply chain management (SCM). In this paper, we introduce a retrieval-augmented generation (RAG) framework that dynamically integrates external knowledge into the inference process, and develop a domain-specialized SCM LLM, which demonstrates expert-level competence by passing standardized SCM examinations and beer game tests. We further employ the use of LLMs to conduct horizontal and vertical supply chain games, in order to analyze competition and cooperation within supply chains. Our experiments show that RAG significantly improves performance on SCM tasks. Moreover, game-theoretic analysis reveals that the LLM can reproduce insights from the classical SCM literature, while also uncovering novel behaviors and offering fresh perspectives on phenomena such as the bullwhip effect. This paper opens the door for exploring cooperation and competition for complex supply chain network through the lens of LLMs.
摘要
大型语言模型(LLM)的发展为供应链管理(SCM)研究提供了新工具。本文提出一种检索增强生成(RAG)框架,该框架能将外部知识动态整合至推理过程,并开发出具备领域专业性的SCM-LLM模型。该模型通过标准化SCM考试和啤酒游戏测试,展现出专家级能力。我们进一步运用LLM开展横向与纵向供应链博弈,以分析供应链内的竞争与合作。实验表明RAG能显著提升SCM任务表现。博弈论分析揭示该LLM既能复现经典SCM文献的洞见,又能发现新行为,为牛鞭效应等现象提供新视角。本研究为通过LLM探索复杂供应链网络的合作与竞争机制开辟了新途径。
Knowledge Retrieval in LLM Gaming: A Shift from Entity-Centric to Goal-Oriented Graphs
Abstract
arXiv:2505.18607v1 Announce Type: new Abstract: Large Language Models (LLMs) demonstrate impressive general capabilities but often struggle with step-by-step reasoning, especially in complex applications such as games. While retrieval-augmented methods like GraphRAG attempt to bridge this gap through cross-document extraction and indexing, their fragmented entity-relation graphs and overly dense local connectivity hinder the construction of coherent reasoning. In this paper, we propose a novel framework based on Goal-Oriented Graphs (GoGs), where each node represents a goal and its associated attributes, and edges encode logical dependencies between goals. This structure enables explicit retrieval of reasoning paths by first identifying high-level goals and recursively retrieving their subgoals, forming coherent reasoning chains to guide LLM prompting. Our method significantly enhances the reasoning ability of LLMs in game-playing tasks, as demonstrated by extensive experiments on the Minecraft testbed, outperforming GraphRAG and other baselines.
摘要
大型语言模型(LLMs)展现出卓越的通用能力,但在逐步推理任务中常面临困难,尤其在游戏等复杂应用场景。尽管GraphRAG等基于检索增强的方法尝试通过跨文档信息提取与索引来弥补这一缺陷,但其碎片化的实体-关系图和过度稠密的局部连接阻碍了连贯推理链的构建。本文提出一种基于目标导向图(GoGs)的新型框架:节点表示目标及其关联属性,边编码目标间的逻辑依赖关系。该结构通过先识别高层目标、再递归检索子目标的方式实现推理路径的显式检索,从而形成连贯的推理链以指导LLM提示。在Minecraft测试平台上的大量实验表明,本方法显著提升了LLMs在游戏任务中的推理能力,其表现优于GraphRAG及其他基线模型。
AI for Regulatory Affairs: Balancing Accuracy, Interpretability, and Computational Cost in Medical Device Classification
Abstract
arXiv:2505.18695v1 Announce Type: new Abstract: Regulatory affairs, which sits at the intersection of medicine and law, can benefit significantly from AI-enabled automation. Classification task is the initial step in which manufacturers position their products to regulatory authorities, and it plays a critical role in determining market access, regulatory scrutiny, and ultimately, patient safety. In this study, we investigate a broad range of AI models -- including traditional machine learning (ML) algorithms, deep learning architectures, and large language models -- using a regulatory dataset of medical device descriptions. We evaluate each model along three key dimensions: accuracy, interpretability, and computational cost.
摘要
作为医药学与法学交叉领域的监管事务,可从人工智能驱动的自动化中显著获益。分类任务是制造商向监管机构申报产品定位的首要环节,对市场准入、监管审查乃至患者安全具有决定性作用。本研究基于医疗器械描述数据集,系统考察了传统机器学习算法、深度学习架构及大语言模型在内的多种人工智能模型。我们从准确性、可解释性和计算成本三个关键维度对各类模型进行了全面评估。
Doc-CoB: Enhancing Multi-Modal Document Understanding with Visual Chain-of-Boxes Reasoning
Abstract
arXiv:2505.18603v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) have made significant progress in document understanding. However, the information-dense nature of document images still poses challenges, as most queries depend on only a few relevant regions, with the rest being redundant. Existing one-pass MLLMs process entire document images without considering query relevance, often failing to focus on critical regions and producing unfaithful responses. Inspired by the human coarse-to-fine reading pattern, we introduce Doc-CoB (Chain-of-Box), a simple-yet-effective mechanism that integrates human-style visual reasoning into MLLM without modifying its architecture. Our method allows the model to autonomously select the set of regions (boxes) most relevant to the query, and then focus attention on them for further understanding. We first design a fully automatic pipeline, integrating a commercial MLLM with a layout analyzer, to generate 249k training samples with intermediate visual reasoning supervision. Then we incorporate two enabling tasks that improve box identification and box-query reasoning, which together enhance document understanding. Extensive experiments on seven benchmarks with four popular models show that Doc-CoB significantly improves performance, demonstrating its effectiveness and wide applicability. All code, data, and models will be released publicly.
摘要
多模态大语言模型(MLLMs)在文档理解领域取得了显著进展。然而,文档图像信息密集的特性仍带来挑战,因为大多数查询仅依赖于少数相关区域,其余部分则冗余。现有的一阶段MLLMs在未考虑查询相关性的情况下处理整个文档图像,往往难以聚焦关键区域并产生不可靠的响应。受人类由粗到细阅读模式的启发,我们提出了Doc-CoB(链式框选)机制,这一简洁高效的方案在不修改模型架构的前提下,将类人视觉推理能力融入MLLM。该方法使模型能自主选择与查询最相关的区域(框)集合,进而集中注意力进行深度理解。我们首先设计了一个全自动流程,将商用MLLM与布局分析器结合,生成24.9万条带有中间视觉推理监督的训练样本。随后引入两项赋能任务以提升框选识别和框-查询推理能力,共同增强文档理解性能。在四个主流模型上对七个基准测试的广泛实验表明,Doc-CoB显著提升了性能,验证了其有效性和广泛适用性。所有代码、数据及模型将公开释放。
AI-Researcher: Autonomous Scientific Innovation
Abstract
arXiv:2505.18705v1 Announce Type: new Abstract: The powerful reasoning capabilities of Large Language Models (LLMs) in mathematics and coding, combined with their ability to automate complex tasks through agentic frameworks, present unprecedented opportunities for accelerating scientific innovation. In this paper, we introduce AI-Researcher, a fully autonomous research system that transforms how AI-driven scientific discovery is conducted and evaluated. Our framework seamlessly orchestrates the complete research pipeline--from literature review and hypothesis generation to algorithm implementation and publication-ready manuscript preparation--with minimal human intervention. To rigorously assess autonomous research capabilities, we develop Scientist-Bench, a comprehensive benchmark comprising state-of-the-art papers across diverse AI research domains, featuring both guided innovation and open-ended exploration tasks. Through extensive experiments, we demonstrate that AI-Researcher achieves remarkable implementation success rates and produces research papers that approach human-level quality. This work establishes new foundations for autonomous scientific innovation that can complement human researchers by systematically exploring solution spaces beyond cognitive limitations.
摘要
大型语言模型(LLMs)在数学与编程领域强大的推理能力,结合其通过智能体框架自动化复杂任务的特点,为加速科学创新提供了前所未有的机遇。本文提出AI-Researcher——一个彻底变革AI驱动科研工作方式与评估体系的完全自主研究系统。该框架能无缝协调从文献综述、假设生成到算法实现及可发表级论文撰写的完整研究流程,仅需极少量人工干预。为系统评估自主科研能力,我们开发了Scientist-Bench综合基准测试,涵盖多个人工智能研究领域的前沿论文,包含定向创新与开放式探索双重任务。大量实验表明,AI-Researcher不仅实现了显著的实施方案成功率,其产出的研究论文质量更接近人类水平。本研究为突破认知局限、系统性探索解决方案空间的自主科学创新奠定了新基础,未来可与人类研究者形成互补。
MLLMs are Deeply Affected by Modality Bias
Abstract
arXiv:2505.18657v1 Announce Type: new Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have shown promising results in integrating diverse modalities such as texts and images. MLLMs are heavily influenced by modality bias, often relying on language while under-utilizing other modalities like visual inputs. This position paper argues that MLLMs are deeply affected by modality bias. Firstly, we diagnose the current state of modality bias, highlighting its manifestations across various tasks. Secondly, we propose a systematic research road-map related to modality bias in MLLMs. Thirdly, we identify key factors of modality bias in MLLMs and offer actionable suggestions for future research to mitigate it. To substantiate these findings, we conduct experiments that demonstrate the influence of each factor: 1. Data Characteristics: Language data is compact and abstract, while visual data is redundant and complex, creating an inherent imbalance in learning dynamics. 2. Imbalanced Backbone Capabilities: The dominance of pretrained language models in MLLMs leads to overreliance on language and neglect of visual information. 3. Training Objectives: Current objectives often fail to promote balanced cross-modal alignment, resulting in shortcut learning biased toward language. These findings highlight the need for balanced training strategies and model architectures to better integrate multiple modalities in MLLMs. We call for interdisciplinary efforts to tackle these challenges and drive innovation in MLLM research. Our work provides a fresh perspective on modality bias in MLLMs and offers insights for developing more robust and generalizable multimodal systems-advancing progress toward Artificial General Intelligence.
摘要
多模态大语言模型(MLLMs)的最新进展在整合文本与图像等多样模态方面展现出显著潜力。然而,MLLMs深受模态偏差影响,往往过度依赖语言模态而忽视视觉输入等其他模态的充分利用。本立场文件论证了MLLMs中存在深层次的模态偏差问题:首先,我们系统诊断了当前模态偏差的表现形式及其在不同任务中的影响;其次,提出针对MLLMs模态偏差的系统性研究路线图;第三,揭示了导致模态偏差的关键因素,并为未来研究提供可操作的缓解建议。通过实验验证,我们证实了以下核心因素的影响机制:1. 数据特性——语言数据具有紧凑性和抽象性,而视觉数据存在冗余性与复杂性,这种固有差异导致学习动态失衡;2. 骨干能力失衡——预训练语言模型在MLLMs中的主导地位引发对语言模态的过度依赖;3. 训练目标缺陷——现有目标函数难以实现跨模态均衡对齐,导致模型倾向于语言捷径学习。这些发现表明,需要开发均衡的训练策略与模型架构以实现多模态的有效整合。我们呼吁跨学科协作应对这些挑战,推动MLLM研究的创新发展。本研究为理解MLLMs中的模态偏差提供了新视角,并为构建更具鲁棒性和泛化性的多模态系统提供了理论依据,这对推进通用人工智能发展具有重要意义。
AI-Driven Climate Policy Scenario Generation for Sub-Saharan Africa
Abstract
arXiv:2505.18694v1 Announce Type: new Abstract: Climate policy scenario generation and evaluation have traditionally relied on integrated assessment models (IAMs) and expert-driven qualitative analysis. These methods enable stakeholders, such as policymakers and researchers, to anticipate impacts, plan governance strategies, and develop mitigation measures. However, traditional methods are often time-intensive, reliant on simple extrapolations of past trends, and limited in capturing the complex and interconnected nature of energy and climate issues. With the advent of artificial intelligence (AI), particularly generative AI models trained on vast datasets, these limitations can be addressed, ensuring robustness even under limited data conditions. In this work, we explore the novel method that employs generative AI, specifically large language models (LLMs), to simulate climate policy scenarios for Sub-Saharan Africa. These scenarios focus on energy transition themes derived from the historical United Nations Climate Change Conference (COP) documents. By leveraging generative models, the project aims to create plausible and diverse policy scenarios that align with regional climate goals and energy challenges. Given limited access to human evaluators, automated techniques were employed for scenario evaluation. We generated policy scenarios using the llama3.2-3B model. Of the 34 generated responses, 30 (88%) passed expert validation, accurately reflecting the intended impacts provided in the corresponding prompts. We compared these validated responses against assessments from a human climate expert and two additional LLMs (gemma2-2B and mistral-7B). Our structured, embedding-based evaluation framework shows that generative AI effectively generate scenarios that are coherent, relevant, plausible, and diverse. This approach offers a transformative tool for climate policy planning in data-constrained regions.
摘要
气候政策情景生成与评估传统上依赖于综合评估模型(IAMs)和专家驱动的定性分析。这些方法使政策制定者和研究人员等利益相关者能够预测影响、规划治理策略并制定缓解措施。然而,传统方法通常耗时较长,依赖于对历史趋势的简单外推,且在捕捉能源与气候问题复杂互联性方面存在局限。随着人工智能(AI)的发展,特别是基于海量数据训练的生成式AI模型,这些限制得以解决,即使在数据有限条件下也能确保稳健性。本研究探索了一种创新方法,利用生成式AI(尤其是大语言模型LLMs)模拟撒哈拉以南非洲的气候政策情景。这些情景聚焦于从历届联合国气候变化大会(COP)文件提取的能源转型主题。通过生成模型,该项目旨在创建符合区域气候目标和能源挑战的合理且多样化的政策情景。由于人类评估者资源有限,研究采用自动化技术进行情景评估。我们使用llama3.2-3B模型生成政策情景,在34条生成响应中,30条(88%)通过专家验证,准确反映了对应提示中的预期影响。我们将这些验证响应与人类气候专家及另外两个LLM模型(gemma2-2B和mistral-7B)的评估结果进行对比。基于嵌入的结构化评估框架表明,生成式AI能有效生成连贯、相关、合理且多样化的情景。该方法为数据受限地区的气候政策规划提供了变革性工具。
-Bench: The Things Real Disturbing LLM based Agent in Multi-Tasking
Abstract
arXiv:2505.18746v1 Announce Type: new Abstract: Agents based on large language models leverage tools to modify environments, revolutionizing how AI interacts with the physical world. Unlike traditional NLP tasks that rely solely on historical dialogue for responses, these agents must consider more complex factors, such as inter-tool relationships, environmental feedback and previous decisions, when making choices. Current research typically evaluates agents via multi-turn dialogues. However, it overlooks the influence of these critical factors on agent behavior. To bridge this gap, we present an open-source and high-quality benchmark -Bench. This benchmark integrates attack concepts and applies univariate analysis to pinpoint key elements affecting agent robustness. In concrete, we design three challenges: navigate complex tool relationships, handle critical hidden information and manage dynamic decision paths. Complementing these challenges, we introduce fine-grained metrics, innovative data collection algorithms and reproducible evaluation methods. Extensive experiments are conducted on 49 mainstream agents, encompassing general fast-thinking, slow-thinking and domain-specific models. We observe that agents have significant shortcomings in handling tool dependencies, long context information dependencies and frequent policy-type switching. In essence, -Bench aims to expose model vulnerabilities through these challenges and drive research into the interpretability of agent performance. The benchmark is publicly available at https://github.com/yupeijei1997/C3-Bench.
摘要
基于大语言模型的智能体通过工具操作改变环境,正在彻底革新人工智能与物理世界的交互方式。与传统自然语言处理任务仅依赖历史对话生成响应不同,此类智能体在决策时需综合考虑工具间关联性、环境反馈和历史选择等复杂因素。当前研究通常通过多轮对话评估智能体性能,却忽视了这些关键因素对智能体行为的影响。为填补这一空白,我们提出了开源高质量基准测试集-Bench。该基准融合攻击概念并采用单变量分析,精准识别影响智能体鲁棒性的关键要素。具体而言,我们设计了三大挑战:复杂工具关系导航、关键隐藏信息处理和动态决策路径管理。配合这些挑战,我们引入了细粒度评估指标、创新的数据收集算法和可复现的评测方法。通过对49个主流智能体(包括通用快思考、慢思考及领域专用模型)的大规模实验,我们发现现有智能体在处理工具依赖性、长上下文信息关联和频繁策略切换方面存在显著缺陷。本质上,-Bench旨在通过这些挑战暴露模型弱点,并推动智能体性能可解释性研究。本基准已开源发布:https://github.com/yupeijei1997/C3-Bench。
Mitigating Deceptive Alignment via Self-Monitoring
Abstract
arXiv:2505.18807v1 Announce Type: new Abstract: Modern large language models rely on chain-of-thought (CoT) reasoning to achieve impressive performance, yet the same mechanism can amplify deceptive alignment, situations in which a model appears aligned while covertly pursuing misaligned goals. Existing safety pipelines treat deception as a black-box output to be filtered post-hoc, leaving the model free to scheme during its internal reasoning. We ask: Can deception be intercepted while the model is thinking? We answer this question, the first framework that embeds a Self-Monitor inside the CoT process itself, named CoT Monitor+. During generation, the model produces (i) ordinary reasoning steps and (ii) an internal self-evaluation signal trained to flag and suppress misaligned strategies. The signal is used as an auxiliary reward in reinforcement learning, creating a feedback loop that rewards honest reasoning and discourages hidden goals. To study deceptive alignment systematically, we introduce DeceptionBench, a five-category benchmark that probes covert alignment-faking, sycophancy, etc. We evaluate various LLMs and show that unrestricted CoT roughly aggravates the deceptive tendency. In contrast, CoT Monitor+ cuts deceptive behaviors by 43.8% on average while preserving task accuracy. Further, when the self-monitor signal replaces an external weak judge in RL fine-tuning, models exhibit substantially fewer obfuscated thoughts and retain transparency. Our project website can be found at cot-monitor-plus.github.io
摘要
现代大型语言模型依赖思维链(CoT)推理实现卓越性能,但该机制也可能放大欺骗性对齐现象——模型表面合规却暗中追求未对齐目标。现有安全方案将欺骗视为需事后过滤的黑箱输出,放任模型在内部推理中持续谋划。我们提出核心问题:能否在模型思考过程中拦截欺骗行为?为此,我们首次提出在CoT流程内部嵌入自我监控框架CoT Monitor+。该框架在生成时同步产生:(i)常规推理步骤;(ii)经训练的内部自评估信号,用于标记并抑制未对齐策略。该信号作为强化学习的辅助奖励,形成促进诚实推理、遏制隐藏目标的反馈循环。为系统研究欺骗性对齐,我们构建DeceptionBench基准测试,涵盖伪装对齐、谄媚行为等五类探测任务。评估表明,无约束CoT平均会加剧42.8%的欺骗倾向,而CoT Monitor+在保持任务准确率的同时,将欺骗行为削减43.8%。进一步研究发现,当自监控信号替代RL微调中的外部弱评估器时,模型显著减少模糊思维并保持透明度。项目网站见cot-monitor-plus.github.io。
The Quest for Efficient Reasoning: A Data-Centric Benchmark to CoT Distillation
Abstract
arXiv:2505.18759v1 Announce Type: new Abstract: Data-centric distillation, including data augmentation, selection, and mixing, offers a promising path to creating smaller, more efficient student Large Language Models (LLMs) that retain strong reasoning abilities. However, there still lacks a comprehensive benchmark to systematically assess the effect of each distillation approach. This paper introduces DC-CoT, the first data-centric benchmark that investigates data manipulation in chain-of-thought (CoT) distillation from method, model and data perspectives. Utilizing various teacher models (e.g., o4-mini, Gemini-Pro, Claude-3.5) and student architectures (e.g., 3B, 7B parameters), we rigorously evaluate the impact of these data manipulations on student model performance across multiple reasoning datasets, with a focus on in-distribution (IID) and out-of-distribution (OOD) generalization, and cross-domain transfer. Our findings aim to provide actionable insights and establish best practices for optimizing CoT distillation through data-centric techniques, ultimately facilitating the development of more accessible and capable reasoning models. The dataset can be found at https://huggingface.co/datasets/rana-shahroz/DC-COT, while our code is shared in https://anonymous.4open.science/r/DC-COT-FF4C/.
摘要
以数据为中心的蒸馏方法(包括数据增强、选择和混合)为创建更小、更高效且保持强大推理能力的学生大语言模型(LLM)提供了一条有前景的路径。然而,目前仍缺乏一个全面的基准来系统评估每种蒸馏方法的效果。本文提出了DC-CoT,这是首个从方法、模型和数据角度系统研究思维链(CoT)蒸馏中数据操作的以数据为中心的基准。通过利用多种教师模型(如o4-mini、Gemini-Pro、Claude-3.5)和学生架构(如3B、7B参数),我们严格评估了这些数据操作对学生模型在多个推理数据集上性能的影响,重点关注分布内(IID)和分布外(OOD)泛化能力以及跨领域迁移。我们的研究旨在为通过以数据为中心的技术优化CoT蒸馏提供可操作的见解,并建立最佳实践,最终推动开发更易获取且能力更强的推理模型。
Stronger Enforcement of Instruction Hierarchy via Augmented Intermediate Representations
Abstract
arXiv:2505.18907v1 Announce Type: new Abstract: Prompt injection attacks are a critical security vulnerability in large language models (LLMs), allowing attackers to hijack model behavior by injecting malicious instructions within the input context. Recent defense mechanisms have leveraged an Instruction Hierarchy (IH) Signal, often implemented through special delimiter tokens or additive embeddings to denote the privilege level of input tokens. However, these prior works typically inject the IH signal exclusively at the initial input layer, which we hypothesize limits its ability to effectively distinguish the privilege levels of tokens as it propagates through the different layers of the model. To overcome this limitation, we introduce a novel approach that injects the IH signal into the intermediate token representations within the network. Our method augments these representations with layer-specific trainable embeddings that encode the privilege information. Our evaluations across multiple models and training methods reveal that our proposal yields between and reduction in attack success rate on gradient-based prompt injection attacks compared to state-of-the-art methods, without significantly degrading the model's utility.
摘要
提示注入攻击是大型语言模型(LLMs)中一种关键的安全漏洞,攻击者通过在输入上下文中注入恶意指令来劫持模型行为。现有防御机制通常采用指令层级(IH)信号,通过特殊分隔符标记或附加嵌入来表示输入标记的权限级别。然而,这些方法通常仅在初始输入层注入IH信号,我们假设这会限制其在模型各层传播过程中有效区分标记权限的能力。为克服这一局限,我们提出一种新方法,将IH信号注入网络中的中间标记表示。该方法通过特定于层的可训练嵌入来增强这些表示,从而编码权限信息。我们在多种模型和训练方法上的评估表明,与现有最优方法相比,该方案在基于梯度的提示 注入攻击中实现了攻击成功率降低1.6至9.2倍的效果,且未显著影响模型效用。
LiteCUA: Computer as MCP Server for Computer-Use Agent on AIOS
Abstract
arXiv:2505.18829v1 Announce Type: new Abstract: We present AIOS 1.0, a novel platform designed to advance computer-use agent (CUA) capabilities through environmental contextualization. While existing approaches primarily focus on building more powerful agent frameworks or enhancing agent models, we identify a fundamental limitation: the semantic disconnect between how language models understand the world and how computer interfaces are structured. AIOS 1.0 addresses this challenge by transforming computers into contextual environments that language models can natively comprehend, implementing a Model Context Protocol (MCP) server architecture to abstract computer states and actions. This approach effectively decouples interface complexity from decision complexity, enabling agents to reason more effectively about computing environments. To demonstrate our platform's effectiveness, we introduce LiteCUA, a lightweight computer-use agent built on AIOS 1.0 that achieves a 14.66% success rate on the OSWorld benchmark, outperforming several specialized agent frameworks despite its simple architecture. Our results suggest that contextualizing computer environments for language models represents a promising direction for developing more capable computer-use agents and advancing toward AI that can interact with digital systems. The source code of LiteCUA is available at https://github.com/agiresearch/LiteCUA, and it is also integrated into the AIOS main branch as part of AIOS at https://github.com/agiresearch/AIOS.
摘要
我们推出AIOS 1.0这一创新平台,旨在通过环境情境化提升计算机使用代理(CUA)的能力。现有方法主要聚焦于构建更强大的代理框架或增强代理模型,但我们发现一个根本性局限:语言模型对世界的理解方式与计算机界面结构之间存在语义断层。AIOS 1.0通过将计算机转化为语言模型可原生理解的情境化环境,采用模型情境协议(MCP)服务器架构来抽象计算机状态与动作,从而有效解决这一挑战。该方法实现了界面复杂度与决策复杂度的解耦,使代理能更高效地推理计算环境。为验证平台效能,我们基于AIOS 1.0开发了轻量级计算机使用代理LiteCUA,其在OSWorld基准测试中取得14.66%的成功率,尽管架构简单却优于多个专用代理框架。研究结果表明,为语言模型构建计算机环境情境化是实现更强大计算机使用代理、推进AI与数字系统交互的重要方向。LiteCUA源代码发布于https://github.com/agiresearch/LiteCUA,并作为AIOS组成部分集成于主分支https://github.com/agiresearch/AIOS。
AdaCtrl: Towards Adaptive and Controllable Reasoning via Difficulty-Aware Budgeting
Abstract
arXiv:2505.18822v1 Announce Type: new Abstract: Modern large reasoning models demonstrate impressive problem-solving capabilities by employing sophisticated reasoning strategies. However, they often struggle to balance efficiency and effectiveness, frequently generating unnecessarily lengthy reasoning chains for simple problems. In this work, we propose AdaCtrl, a novel framework to support both difficulty-aware adaptive reasoning budget allocation and explicit user control over reasoning depth. AdaCtrl dynamically adjusts its reasoning length based on self-assessed problem difficulty, while also allowing users to manually control the budget to prioritize either efficiency or effectiveness. This is achieved through a two-stage training pipeline: an initial cold-start fine-tuning phase to instill the ability to self-aware difficulty and adjust reasoning budget, followed by a difficulty-aware reinforcement learning (RL) stage that refines the model's adaptive reasoning strategies and calibrates its difficulty assessments based on its evolving capabilities during online training. To enable intuitive user interaction, we design explicit length-triggered tags that function as a natural interface for budget control. Empirical results show that AdaCtrl adapts reasoning length based on estimated difficulty, compared to the standard training baseline that also incorporates fine-tuning and RL, it yields performance improvements and simultaneously reduces response length by 10.06% and 12.14% on the more challenging AIME2024 and AIME2025 datasets, which require elaborate reasoning, and by 62.05% and 91.04% on the MATH500 and GSM8K datasets, where more concise responses are sufficient. Furthermore, AdaCtrl enables precise user control over the reasoning budget, allowing for tailored responses to meet specific needs.
摘要
现代大型推理模型通过采用复杂的推理策略展现出令人印象深刻的问题解决能力。然而,这些模型往往难以平衡效率与效果,经常为简单问题生成不必要的冗长推理链。本研究提出AdaCtrl框架,该创新系统同时支持难度感知的自适应推理预算分配和用户对推理深度的显式控制。AdaCtrl根据自评估的问题难度动态调整推理长度,同时允许用户手动控制预算以优先考虑效率或效果。这一功能通过两阶段训练流程实现:首先是冷启动微调阶段,用于培养模型对难度的自我认知和推理预算调整能力;随后是难度感知强化学习阶段,该阶段优化模型的自适应推理策略,并根据在线训练过程中不断演进的能力校准其难度评估。为实现直观的用户交互,我们设计了显式的长度触发标签作为预算控制的自然界面。实验结果表明,相较于同样包含微调和强化学习的标准训练基线,AdaCtrl能基于预估难度调整推理长度——在需要精细推理的AIME2024和AIME2025数据集上,性能提升的同时分别减少10.06%和12.14%的响应长度;而在更简短响应即可满足需求的MATH500和GSM8K数据集上,缩减幅度分别达到62.05%和91.04%。此外,AdaCtrl实现了对推理预算的精确用户控制,可生成满足特定需求的定制化响应。
Signal, Image, or Symbolic: Exploring the Best Input Representation for Electrocardiogram-Language Models Through a Unified Framework
Abstract
arXiv:2505.18847v1 Announce Type: new Abstract: Recent advances have increasingly applied large language models (LLMs) to electrocardiogram (ECG) interpretation, giving rise to Electrocardiogram-Language Models (ELMs). Conditioned on an ECG and a textual query, an ELM autoregressively generates a free-form textual response. Unlike traditional classification-based systems, ELMs emulate expert cardiac electrophysiologists by issuing diagnoses, analyzing waveform morphology, identifying contributing factors, and proposing patient-specific action plans. To realize this potential, researchers are curating instruction-tuning datasets that pair ECGs with textual dialogues and are training ELMs on these resources. Yet before scaling ELMs further, there is a fundamental question yet to be explored: What is the most effective ECG input representation? In recent works, three candidate representations have emerged-raw time-series signals, rendered images, and discretized symbolic sequences. We present the first comprehensive benchmark of these modalities across 6 public datasets and 5 evaluation metrics. We find symbolic representations achieve the greatest number of statistically significant wins over both signal and image inputs. We further ablate the LLM backbone, ECG duration, and token budget, and we evaluate robustness to signal perturbations. We hope that our findings offer clear guidance for selecting input representations when developing the next generation of ELMs.
摘要
近年来,大型语言模型(LLMs)在心电图(ECG)解读中的应用日益增多,催生了心电图-语言模型(ELMs)。基于心电图和文本查询的条件,ELM能够自回归生成自由形式的文本响应。与传统基于分类的系统不同,ELMs通过发布诊断、分析波形形态、识别影响因素并提出针对患者的个性化行动计划,模拟了心脏电生理学专家的行为。为实现这一潜力,研究人员正在整理将心电图与文本对话配对的教学调优数据集,并基于这些资源训练ELMs。然而,在进一步扩展ELMs之前,一个尚未探索的基本问题是:最有效的心电图输入表示是什么?在最近的研究中,出现了三种候选表示形式——原始时间序列信号、渲染图像和离散化符号序列。我们首次对这些模态在6个公共数据集和5个评估指标上进行了全面基准测试。研究发现,符号表示在统计显著性上优于信号和图像输入的情况最多。我们进一步对LLM主干、ECG持续时间和令牌预算进行了消融实验,并评估了对信号扰动的鲁棒性。希望我们的研究结果为开发下一代ELMs时选择输入表示提供了明确的指导。
SQUiD: Synthesizing Relational Databases from Unstructured Text
Abstract
arXiv:2505.19025v1 Announce Type: new Abstract: Relational databases are central to modern data management, yet most data exists in unstructured forms like text documents. To bridge this gap, we leverage large language models (LLMs) to automatically synthesize a relational database by generating its schema and populating its tables from raw text. We introduce SQUiD, a novel neurosymbolic framework that decomposes this task into four stages, each with specialized techniques. Our experiments show that SQUiD consistently outperforms baselines across diverse datasets.
摘要
关系数据库是现代数据管理的核心,但大多数数据以非结构化形式(如文本文档)存在。为弥合这一鸿沟,我们利用大语言模型(LLM)从原始文本自动生成数据库模式并填充表格,从而合成关系数据库。本文提出新型神经符号框架SQUiD,将该任务分解为四个阶段,每个阶段采用专门技术。实验表明,SQUiD在多样化数据集上始终优于基线方法。
REACT: Representation Extraction And Controllable Tuning to Overcome Overfitting in LLM Knowledge Editing
Abstract
arXiv:2505.18933v1 Announce Type: new Abstract: Large language model editing methods frequently suffer from overfitting, wherein factual updates can propagate beyond their intended scope, overemphasizing the edited target even when it's contextually inappropriate. To address this challenge, we introduce REACT (Representation Extraction And Controllable Tuning), a unified two-phase framework designed for precise and controllable knowledge editing. In the initial phase, we utilize tailored stimuli to extract latent factual representations and apply Principal Component Analysis with a simple learnbale linear transformation to compute a directional "belief shift" vector for each instance. In the second phase, we apply controllable perturbations to hidden states using the obtained vector with a magnitude scalar, gated by a pre-trained classifier that permits edits only when contextually necessary. Relevant experiments on EVOKE benchmarks demonstrate that REACT significantly reduces overfitting across nearly all evaluation metrics, and experiments on COUNTERFACT and MQuAKE shows that our method preserves balanced basic editing performance (reliability, locality, and generality) under diverse editing scenarios.
摘要
大语言模型编辑方法普遍存在过拟合问题,即事实更新可能超出预期范围,即使在不恰当的语境下也会过度强调编辑目标。为解决这一挑战,我们提出REACT(表征提取与可控调谐)框架,这是一个为精确可控知识编辑设计的统一两阶段方案。第一阶段通过定制化刺激提取潜在事实表征,并采用主成分分析与可学习的线性变换计算每个实例的方向性"信念偏移"向量。第二阶段利用所得向量及幅度标量对隐藏状态施加可控扰动,其门控机制由预训练分类器实现,仅在语境需要时允许编辑。在EVOKE基准测试中的实验表明,REACT在几乎所有评估指标上显著降低了过拟合现象;而在COUNTERFACT和MQuAKE上的实验证明,该方法在多样化编辑场景下能保持可靠度、局部性与泛化性等基础编辑性能的平衡。
Can Large Language Models Infer Causal Relationships from Real-World Text?
Abstract
arXiv:2505.18931v1 Announce Type: new Abstract: Understanding and inferring causal relationships from texts is a core aspect of human cognition and is essential for advancing large language models (LLMs) towards artificial general intelligence. Existing work primarily focuses on synthetically generated texts which involve simple causal relationships explicitly mentioned in the text. This fails to reflect the complexities of real-world tasks. In this paper, we investigate whether LLMs are capable of inferring causal relationships from real-world texts. We develop a benchmark drawn from real-world academic literature which includes diverse texts with respect to length, complexity of relationships (different levels of explicitness, number of events, and causal relationships), and domains and sub-domains. To the best of our knowledge, our benchmark is the first-ever real-world dataset for this task. Our experiments on state-of-the-art LLMs evaluated on our proposed benchmark demonstrate significant challenges, with the best-performing model achieving an average F1 score of only 0.477. Analysis reveals common pitfalls: difficulty with implicitly stated information, in distinguishing relevant causal factors from surrounding contextual details, and with connecting causally relevant information spread across lengthy textual passages. By systematically characterizing these deficiencies, our benchmark offers targeted insights for further research into advancing LLM causal reasoning.
摘要
从文本中理解和推断因果关系是人类认知的核心方面,也是推动大语言模型(LLMs)迈向通用人工智能的关键。现有研究主要集中于合成生成的文本,这些文本仅涉及文中明确提及的简单因果关系,未能反映现实任务的复杂性。本文探究LLMs能否从现实世界文本中推断因果关系。我们构建了一个源自真实学术文献的基准测试集,包含长度各异、关系复杂度不同(明确性程度、事件数量及因果关系的差异)以及跨领域和子领域的多样化文本。据我们所知,这是该任务首个真实世界数据集。基于该基准对前沿LLMs的实验表明存在重大挑战,表现最佳模型的平均F1分数仅为0.477。分析揭示了常见缺陷:难以处理隐含信息、无法区分相关因果因素与上下文细节、以及难以整合分散在长文本中的因果相关信息。通过系统化表征这些不足,我们的基准为推进LLM因果推理的后续研究提供了针对性启示。
Meta-aware Learning in text-to-SQL Large Language Model
Abstract
arXiv:2505.18929v1 Announce Type: new Abstract: The advancements of Large language models (LLMs) have provided great opportunities to text-to-SQL tasks to overcome the main challenges to understand complex domain information and complex database structures in business applications. In this paper, we propose a meta-aware learning framework to integrate domain knowledge, database schema, chain-of-thought reasoning processes, and metadata relationships to improve the SQL generation quality. The proposed framework includes four learning strategies: schema-based learning, Chain-of-Thought (CoT) learning, knowledge-enhanced learning, and key information tokenization. This approach provides a comprehensive understanding of database structure and metadata information towards LLM through fine-tuning to improve its performance on SQL generation within business domains. Through two experimental studies, we have demonstrated the superiority of the proposed methods in execution accuracy, multi-task SQL generation capability, and reduction of catastrophic forgetting.
摘要
大型语言模型(LLM)的进步为文本到SQL任务提供了重要机遇,以克服商业应用中理解复杂领域信息和复杂数据库结构的主要挑战。本文提出一种元感知学习框架,通过整合领域知识、数据库模式、思维链推理过程及元数据关系来提升SQL生成质量。该框架包含四种学习策略:基于模式的学习、思维链(CoT)学习、知识增强学习和关键信息标记化。该方法通过微调使LLM全面理解数据库结构和元数据信息,从而提升其在商业领域内SQL生成的性能。通过两项实验研究,我们验证了所提方法在执行准确率、多任务SQL生成能力以及减少灾难性遗忘方面的优越性。
Aligning LLM with human travel choices: a persona-based embedding learning approach
Abstract
arXiv:2505.19003v1 Announce Type: new Abstract: The advent of large language models (LLMs) presents new opportunities for travel demand modeling. However, behavioral misalignment between LLMs and humans presents obstacles for the usage of LLMs, and existing alignment methods are frequently inefficient or impractical given the constraints of typical travel demand data. This paper introduces a novel framework for aligning LLMs with human travel choice behavior, tailored to the current travel demand data sources. Our framework uses a persona inference and loading process to condition LLMs with suitable prompts to enhance alignment. The inference step establishes a set of base personas from empirical data, and a learned persona loading function driven by behavioral embeddings guides the loading process. We validate our framework on the Swissmetro mode choice dataset, and the results show that our proposed approach significantly outperformed baseline choice models and LLM-based simulation models in predicting both aggregate mode choice shares and individual choice outcomes. Furthermore, we showcase that our framework can generate insights on population behavior through interpretable parameters. Overall, our research offers a more adaptable, interpretable, and resource-efficient pathway to robust LLM-based travel behavior simulation, paving the way to integrate LLMs into travel demand modeling practice in the future.
摘要
大型语言模型(LLMs)的出现为交通需求建模带来了新的机遇。然而,LLMs与人类行为之间的偏差阻碍了其应用,且现有对齐方法在典型交通需求数据限制下往往效率低下或难以实施。本文提出一种新颖的框架,旨在使LLMs与人类出行选择行为对齐,并适应当前交通数据源特点。该框架通过角色推断与加载流程,利用合适的提示词对LLMs进行条件约束以提升对齐效果:推断步骤从实证数据中建立基础角色集,而由行为嵌入驱动的学习型角色加载函数则指导加载过程。我们在Swissmetro出行方式选择数据集上验证了该框架,结果表明所提方法在预测总体方式选择份额和个体选择结果方面,显著优于基线选择模型和基于LLM的仿真模型。此外,我们证明该框架可通过可解释参数生成群体行为洞见。总体而言,本研究为基于LLM的稳健交通行为仿真提供了更具适应性、可解释性且资源高效的路径,为未来将LLMs整合至交通需求建模实践奠定了基础。
Weaver: Interweaving SQL and LLM for Table Reasoning
Abstract
arXiv:2505.18961v1 Announce Type: new Abstract: Querying tables with unstructured data is challenging due to the presence of text (or image), either embedded in the table or in external paragraphs, which traditional SQL struggles to process, especially for tasks requiring semantic reasoning. While Large Language Models (LLMs) excel at understanding context, they face limitations with long input sequences. Existing approaches that combine SQL and LLMs typically rely on rigid, predefined work-flows, limiting their adaptability to complex queries. To address these issues, we introduce Weaver , a modular pipeline that dynamically integrates SQL and LLMs for table-based question answering (TableQA). Weaver generates a flexible, step-by-step plan that combines SQL for structured data retrieval with LLMs for semantic processing. By decomposing complex queries into manageable subtasks, Weaver improves accuracy and generalization. Our experiments show that Weaver consistently outperforms state-of-the-art methods across four TableQA datasets, reducing both API calls and error rates.
摘要
由于表格中存在嵌入文本(或图像)或外部段落中的非结构化数据,传统SQL难以处理此类查询任务,尤其是需要语义推理的场景。尽管大语言模型(LLMs)擅长上下文理解,但面对长输入序列时仍存在局限。现有结合SQL与LLMs的方法通常依赖僵化的预定义工作流程,难以适应复杂查询需求。为此,我们提出Weaver——一种模块化流水线,通过动态整合SQL与LLMs实现基于表格的问答(TableQA)。Weaver生成灵活的逐步执行计划,结合SQL的结构化数据检索与LLMs的语义处理能力。通过将复杂查询分解为可处理的子任务,该系统显著提升了准确性与泛化能力。实验表明,Weaver在四个TableQA数据集上持续优于现有最优方法,同时降低了API调用次数与错误率。
RECAST: Strengthening LLMs' Complex Instruction Following with Constraint-Verifiable Data
Abstract
arXiv:2505.19030v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly expected to tackle complex tasks, driven by their expanding applications and users' growing proficiency in crafting sophisticated prompts. However, as the number of explicitly stated requirements increases (particularly more than 10 constraints), LLMs often struggle to accurately follow such complex instructions. To address this challenge, we propose RECAST, a novel framework for synthesizing datasets where each example incorporates far more constraints than those in existing benchmarks. These constraints are extracted from real-world prompt-response pairs to ensure practical relevance. RECAST enables automatic verification of constraint satisfaction via rule-based validators for quantitative constraints and LLM-based validators for qualitative ones. Using this framework, we construct RECAST-30K, a large-scale, high-quality dataset comprising 30k instances spanning 15 constraint types. Experimental results demonstrate that models fine-tuned on RECAST-30K show substantial improvements in following complex instructions. Moreover, the verifiability provided by RECAST enables the design of reward functions for reinforcement learning, which further boosts model performance on complex and challenging tasks.
摘要
随着大语言模型(LLMs)应用范围的扩大以及用户编写复杂提示能力的提升,它们被越来越多地要求处理复杂任务。然而,当显式声明的需求数量增加(尤其是超过10个约束条件时),LLMs往往难以准确遵循此类复杂指令。为解决这一挑战,我们提出了RECAST框架,该框架通过合成数据集使每个样本包含远超现有基准的约束条件。这些约束从真实世界的提示-响应对中提取,以确保实际相关性。RECAST支持通过基于规则的验证器自动检验定量约束的满足情况,并利用基于LLM的验证器检验定性约束。借助该框架,我们构建了RECAST-30K数据集——一个包含15种约束类型、规模达3万实例的大规模高质量数据集。实验结果表明,基于RECAST-30K微调的模型在遵循复杂指令方面表现出显著提升。此外,RECAST提供的可验证性为强化学习的奖励函数设计提供了支持,从而进一步提高了模型在复杂挑战性任务上的性能。
Co-PatcheR: Collaborative Software Patching with Component(s)-specific Small Reasoning Models
Abstract
arXiv:2505.18955v1 Announce Type: new Abstract: Motivated by the success of general-purpose large language models (LLMs) in software patching, recent works started to train specialized patching models. Most works trained one model to handle the end-to-end patching pipeline (including issue localization, patch generation, and patch validation). However, it is hard for a small model to handle all tasks, as different sub-tasks have different workflows and require different expertise. As such, by using a 70 billion model, SOTA methods can only reach up to 41% resolved rate on SWE-bench-Verified. Motivated by the collaborative nature, we propose Co-PatcheR, the first collaborative patching system with small and specialized reasoning models for individual components. Our key technique novelties are the specific task designs and training recipes. First, we train a model for localization and patch generation. Our localization pinpoints the suspicious lines through a two-step procedure, and our generation combines patch generation and critique. We then propose a hybrid patch validation that includes two models for crafting issue-reproducing test cases with and without assertions and judging patch correctness, followed by a majority vote-based patch selection. Through extensive evaluation, we show that Co-PatcheR achieves 46% resolved rate on SWE-bench-Verified with only 3 x 14B models. This makes Co-PatcheR the best patcher with specialized models, requiring the least training resources and the smallest models. We conduct a comprehensive ablation study to validate our recipes, as well as our choice of training data number, model size, and testing-phase scaling strategy.
摘要
受通用大语言模型(LLM)在软件补丁生成领域成功的启发,近期研究开始训练专用补丁生成模型。现有工作大多训练单一模型处理端到端补丁流程(包括问题定位、补丁生成和补丁验证)。然而,小型模型难以胜任所有子任务,因为不同子任务具有差异化的工作流程和专业知识需求。因此,当前最佳方法使用700亿参数模型时,在SWE-bench-Verified基准上仅能达到41%的修复率。基于协作机制的思想,我们提出首个协作式补丁系统Co-PatcheR,该系统采用小型专用推理模型分别处理各组件任务。我们的核心技术创新在于特定任务设计与训练方案:首先训练定位与补丁生成联合模型,其中定位模块通过两级流程精确定位可疑代码行,生成模块整合补丁生成与批判式改进;随后提出混合补丁验证机制,包含两个模型分别用于生成带断言/不带断言的问题复现测试用例、判断补丁正确性,最终基于多数表决机制选择补丁。大量实验表明,Co-PatcheR仅使用3个140亿参数模型即在SWE-bench-Verified上实现46%的修复率,成为专用模型中性能最佳、训练资源需求最低且模型尺寸最小的补丁系统。我们通过全面消融实验验证了训练方案的有效性,以及对训练数据量、模型规模和测试阶段扩展策略的选择依据。
OrgAccess: A Benchmark for Role Based Access Control in Organization Scale LLMs
Abstract
arXiv:2505.19165v1 Announce Type: new Abstract: Role-based access control (RBAC) and hierarchical structures are foundational to how information flows and decisions are made within virtually all organizations. As the potential of Large Language Models (LLMs) to serve as unified knowledge repositories and intelligent assistants in enterprise settings becomes increasingly apparent, a critical, yet under explored, challenge emerges: \textit{can these models reliably understand and operate within the complex, often nuanced, constraints imposed by organizational hierarchies and associated permissions?} Evaluating this crucial capability is inherently difficult due to the proprietary and sensitive nature of real-world corporate data and access control policies. We introduce a synthetic yet representative \textbf{OrgAccess} benchmark consisting of 40 distinct types of permissions commonly relevant across different organizational roles and levels. We further create three types of permissions: 40,000 easy (1 permission), 10,000 medium (3-permissions tuple), and 20,000 hard (5-permissions tuple) to test LLMs' ability to accurately assess these permissions and generate responses that strictly adhere to the specified hierarchical rules, particularly in scenarios involving users with overlapping or conflicting permissions. Our findings reveal that even state-of-the-art LLMs struggle significantly to maintain compliance with role-based structures, even with explicit instructions, with their performance degrades further when navigating interactions involving two or more conflicting permissions. Specifically, even \textbf{GPT-4.1 only achieves an F1-Score of 0.27 on our hardest benchmark}. This demonstrates a critical limitation in LLMs' complex rule following and compositional reasoning capabilities beyond standard factual or STEM-based benchmarks, opening up a new paradigm for evaluating their fitness for practical, structured environments.
SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning
Abstract
arXiv:2505.19099v1 Announce Type: new Abstract: We present SeePhys, a large-scale multimodal benchmark for LLM reasoning grounded in physics questions ranging from middle school to PhD qualifying exams. The benchmark covers 7 fundamental domains spanning the physics discipline, incorporating 21 categories of highly heterogeneous diagrams. In contrast to prior works where visual elements mainly serve auxiliary purposes, our benchmark features a substantial proportion of vision-essential problems (75%) that mandate visual information extraction for correct solutions. Through extensive evaluation, we observe that even the most advanced visual reasoning models (e.g., Gemini-2.5-pro and o4-mini) achieve sub-60% accuracy on our benchmark. These results reveal fundamental challenges in current large language models' visual understanding capabilities, particularly in: (i) establishing rigorous coupling between diagram interpretation and physics reasoning, and (ii) overcoming their persistent reliance on textual cues as cognitive shortcuts.
摘要
我们推出SeePhys——一个基于从中学到博士资格考试物理问题的大规模多模态基准测试,用于评估大语言模型的物理推理能力。该基准涵盖物理学7个基础领域,包含21类高度异质化的图表。与先前研究中视觉元素主要起辅助作用不同,我们的基准测试中视觉关键问题占比高达75%,这类问题必须通过视觉信息提取才能获得正确答案。通过广泛评估发现,即使最先进的视觉推理模型(如Gemini-2.5-pro和o4-mini)在本基准上的准确率也不足60%。这些结果揭示了当前大语言模型在视觉理解能力上存在根本性挑战,主要体现在:(i) 难以建立图表解析与物理推理之间的严格耦合关系;(ii) 无法克服对文本线索作为认知捷径的持续依赖。
Reinforced Latent Reasoning for LLM-based Recommendation
Abstract
arXiv:2505.19092v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated impressive reasoning capabilities in complex problem-solving tasks, sparking growing interest in their application to preference reasoning in recommendation systems. Existing methods typically rely on fine-tuning with explicit chain-of-thought (CoT) data. However, these methods face significant practical limitations due to (1) the difficulty of obtaining high-quality CoT data in recommendation and (2) the high inference latency caused by generating CoT reasoning. In this work, we explore an alternative approach that shifts from explicit CoT reasoning to compact, information-dense latent reasoning. This approach eliminates the need for explicit CoT generation and improves inference efficiency, as a small set of latent tokens can effectively capture the entire reasoning process. Building on this idea, we propose \textit{\underline{R}einforced \underline{Latent} \underline{R}easoning for \underline{R}ecommendation} (LatentR), a novel end-to-end training framework that leverages reinforcement learning (RL) to optimize latent reasoning without relying on any CoT data.LatentR adopts a two-stage training strategy: first, supervised fine-tuning to initialize the latent reasoning module, followed by pure RL training to encourage exploration through a rule-based reward design. Our RL implementation is based on a modified GRPO algorithm, which reduces computational overhead during training and introduces continuous reward signals for more efficient learning. Extensive experiments demonstrate that LatentR enables effective latent reasoning without any direct supervision of the reasoning process, significantly improving performance when integrated with different LLM-based recommendation methods. Our codes are available at https://anonymous.4open.science/r/R3-A278/.
摘要
大型语言模型(LLMs)在复杂问题解决任务中展现出卓越的推理能力,这激发了人们对其在推荐系统中偏好推理应用的日益关注。现有方法通常依赖于显式思维链(CoT)数据的微调,但这些方法面临两大实际限制:(1)难以获取高质量的推荐领域CoT数据;(2) 生成CoT推理导致的高推理延迟。本研究探索了一种替代方案,将显式CoT推理转向紧凑、信息密集的潜在推理。该方法无需生成显式CoT,并通过少量潜在令牌即可完整捕获推理过程,从而提升推理效率。基于此,我们提出《推荐系统中的强化潜在推理》(LatentR³)——一种端到端训练框架,利用强化学习(RL)优化潜在推理且不依赖任何CoT数据。LatentR³采用两阶段训练策略:首先通过监督微调初始化潜在推理模块,再通过基于规则的奖励设计进行纯RL训练以促进探索。我们的RL实现基于改进的GRPO算法,可降低训练计算开销并提供连续奖励信号以实现高效学习。大量实验表明,LatentR³能在无任何推理过程直接监督的情况下实现有效潜在推理,当与不同基于LLM的推荐方法结合时显著提升性能。代码发布于https://anonymous.4open.science/r/R3-A278/。
ScreenExplorer: Training a Vision-Language Model for Diverse Exploration in Open GUI World
Abstract
arXiv:2505.19095v1 Announce Type: new Abstract: The rapid progress of large language models (LLMs) has sparked growing interest in building Artificial General Intelligence (AGI) within Graphical User Interface (GUI) environments. However, existing GUI agents based on LLMs or vision-language models (VLMs) often fail to generalize to novel environments and rely heavily on manually curated, diverse datasets. To overcome these limitations, we introduce ScreenExplorer, a VLM trained via Group Relative Policy Optimization(GRPO) in real, dynamic, and open-ended GUI environments. Innovatively, we introduced a world-model-based curiosity reward function to help the agent overcome the cold-start phase of exploration. Additionally, distilling experience streams further enhances the model's exploration capabilities. Our training framework enhances model exploration in open GUI environments, with trained models showing better environmental adaptation and sustained exploration compared to static deployment models. Our findings offer a scalable pathway toward AGI systems with self-improving capabilities in complex interactive settings.
摘要
大语言模型(LLM)的快速发展引发了在图形用户界面(GUI)环境中构建人工通用智能(AGI)的日益增长的兴趣。然而,现有基于LLM或视觉语言模型(VLM)的GUI智能体往往难以泛化至新环境,且严重依赖人工整理的多样化数据集。为克服这些局限,我们提出了ScreenExplorer——一种通过群体相对策略优化(GRPO)在真实、动态且开放式的GUI环境中训练的VLM。创新性地,我们引入了基于世界模型的好奇心奖励函数,以帮助智能体克服探索的冷启动阶段。此外,经验流的蒸馏进一步增强了模型的探索能力。我们的训练框架提升了模型在开放GUI环境中的探索性能,与静态部署模型相比,经过训练的模型展现出更好的环境适应性与持续探索能力。本研究为复杂交互场景中具有自我提升能力的AGI系统提供了一条可扩展的发展路径。
Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs
Abstract
arXiv:2505.19075v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated remarkable general capabilities, but enhancing skills such as reasoning often demands substantial computational resources and may compromise their generalization. While Parameter-Efficient Fine-Tuning (PEFT) methods offer a more resource-conscious alternative, they typically requires retraining for each LLM backbone due to architectural dependencies. To address these challenges, here we propose Universal Reasoner (UniR) - a single, lightweight, composable, and plug-and-play reasoning module that can be used with any frozen LLM to endow it with specialized reasoning capabilities. Specifically, UniR decomposes the reward into a standalone reasoning module that is trained independently using predefined rewards, effectively translating trajectory-level signals into token-level guidance. Once trained, UniR can be combined with any frozen LLM at inference time by simply adding its output logits to those of the LLM backbone. This additive structure naturally enables modular composition: multiple UniR modules trained for different tasks can be jointly applied by summing their logits, enabling complex reasoning via composition. Experimental results on mathematical reasoning and machine translation tasks show that UniR significantly outperforms \add{existing baseline fine-tuning methods using the Llama3.2 model}. Furthermore, UniR demonstrates strong weak-to-strong generalization: reasoning modules trained on smaller models effectively guide much larger LLMs. This makes UniR a cost-efficient, adaptable, and robust solution for enhancing reasoning in LLMs without compromising their core capabilities. Code is open-sourced at https://github.com/hangeol/UniR
摘要
大语言模型(LLMs)已展现出卓越的通用能力,但提升推理等专项技能通常需要大量计算资源,并可能削弱其泛化性能。虽然参数高效微调(PEFT)方法提供了更节约资源的替代方案,但由于架构依赖性,这些方法通常需要针对每个LLM主干进行重新训练。为解决这些问题,本文提出通用推理器(UniR)——一个轻量级、可组合、即插即用的独立推理模块,可与任何冻结的LLM结合使用以赋予其专业推理能力。具体而言,UniR将奖励函数解耦为独立的推理模块,通过预定义奖励进行独立训练,从而将轨迹级信号有效转化为词元级指导。训练完成后,UniR只需将其输出逻辑值与LLM主干的逻辑值相加,即可在推理阶段与任意冻结的LLM结合使用。这种可加性结构天然支持模块化组合:针对不同任务训练的多个UniR模块可通过逻辑值求和实现联合应用,从而通过组合完成复杂推理。在数学推理和机器翻译任务上的实验表明,UniR显著优于使用Llama3.2模型的现有基线微调方法。此外,UniR展现出强大的弱到强泛化能力:基于较小模型训练的推理模块能有效指导规模更大的LLMs。这使得UniR成为一种在不损害核心能力的前提下,提升LLM推理能力的经济高效、适应性强且稳健的解决方案。
CardioCoT: Hierarchical Reasoning for Multimodal Survival Analysis
Abstract
arXiv:2505.19195v1 Announce Type: new Abstract: Accurate prediction of major adverse cardiovascular events recurrence risk in acute myocardial infarction patients based on postoperative cardiac MRI and associated clinical notes is crucial for precision treatment and personalized intervention. Existing methods primarily focus on risk stratification capability while overlooking the need for intermediate robust reasoning and model interpretability in clinical practice. Moreover, end-to-end risk prediction using LLM/VLM faces significant challenges due to data limitations and modeling complexity. To bridge this gap, we propose CardioCoT, a novel two-stage hierarchical reasoning-enhanced survival analysis framework designed to enhance both model interpretability and predictive performance. In the first stage, we employ an evidence-augmented self-refinement mechanism to guide LLM/VLMs in generating robust hierarchical reasoning trajectories based on associated radiological findings. In the second stage, we integrate the reasoning trajectories with imaging data for risk model training and prediction. CardioCoT demonstrates superior performance in MACE recurrence risk prediction while providing interpretable reasoning processes, offering valuable insights for clinical decision-making.
摘要
基于术后心脏磁共振成像及相关临床记录,准确预测急性心肌梗死患者主要不良心血管事件复发风险对于精准治疗和个性化干预至关重要。现有方法主要关注风险分层能力,而忽视了临床实践中对中间稳健推理和模型可解释性的需求。此外,由于数据限制和建模复杂性,使用LLM/VLM进行端到端风险预测面临重大挑战。为弥补这一空白,我们提出CardioCoT——一个新颖的两阶段分层推理增强生存分析框架,旨在同时提升模型可解释性和预测性能。第一阶段采用证据增强的自优化机制,引导LLM/VLM基于相关放射学发现生成稳健的分层推理轨迹;第二阶段将推理轨迹与影像数据整合进行风险模型训练与预测。CardioCoT在MACE复发风险预测中展现出卓越性能,同时提供可解释的推理过程,为临床决策提供宝贵见解。
Structuring the Unstructured: A Multi-Agent System for Extracting and Querying Financial KPIs and Guidance
Abstract
arXiv:2505.19197v1 Announce Type: new Abstract: Extracting structured and quantitative insights from unstructured financial filings is essential in investment research, yet remains time-consuming and resource-intensive. Conventional approaches in practice rely heavily on labor-intensive manual processes, limiting scalability and delaying the research workflow. In this paper, we propose an efficient and scalable method for accurately extracting quantitative insights from unstructured financial documents, leveraging a multi-agent system composed of large language models. Our proposed multi-agent system consists of two specialized agents: the \emph{Extraction Agent} and the \emph{Text-to-SQL Agent}. The \textit{Extraction Agent} automatically identifies key performance indicators from unstructured financial text, standardizes their formats, and verifies their accuracy. On the other hand, the \textit{Text-to-SQL Agent} generates executable SQL statements from natural language queries, allowing users to access structured data accurately without requiring familiarity with the database schema. Through experiments, we demonstrate that our proposed system effectively transforms unstructured text into structured data accurately and enables precise retrieval of key information. First, we demonstrate that our system achieves approximately 95% accuracy in transforming financial filings into structured data, matching the performance level typically attained by human annotators. Second, in a human evaluation of the retrieval task -- where natural language queries are used to search information from structured data -- 91% of the responses were rated as correct by human evaluators. In both evaluations, our system generalizes well across financial document types, consistently delivering reliable performance.
摘要
从非结构化财务文件中提取结构化定量洞察对投资研究至关重要,但这一过程仍耗时且资源密集。传统实践方法严重依赖劳动密集型人工处理,限制了可扩展性并延缓研究流程。本文提出一种高效可扩展的方法,通过基于大语言模型的多智能体系统,从非结构化财务文档中准确提取定量信息。我们设计的多智能体系统包含两个专用代理:提取代理和文本转SQL代理。提取代理能自动识别非结构化财务文本中的关键绩效指标,标准化其格式并验证准确性;而文本转SQL代理可将自然语言查询转换为可执行SQL语句,使用户无需了解数据库模式即可准确访问结构化数据。实验表明,本系统能有效将非结构化文本准确转化为结构化数据并实现关键信息的精准检索。首先,系统在财务文件结构化转换中达到约95%的准确率,与人工标注水平相当;其次,在基于自然语言查询的结构化数据检索任务的人为评估中,91%的响应被评估者判定为正确。两项评估均显示,本系统对不同类型财务文档具有良好的泛化能力,能持续提供可靠性能。
Investigating Pedagogical Teacher and Student LLM Agents: Genetic Adaptation Meets Retrieval Augmented Generation Across Learning Style
Abstract
arXiv:2505.19173v1 Announce Type: new Abstract: Effective teaching requires adapting instructional strategies to accommodate the diverse cognitive and behavioral profiles of students, a persistent challenge in education and teacher training. While Large Language Models (LLMs) offer promise as tools to simulate such complex pedagogical environments, current simulation frameworks are limited in two key respects: (1) they often reduce students to static knowledge profiles, and (2) they lack adaptive mechanisms for modeling teachers who evolve their strategies in response to student feedback. To address these gaps, \textbf{we introduce a novel simulation framework that integrates LLM-based heterogeneous student agents with a self-optimizing teacher agent}. The teacher agent's pedagogical policy is dynamically evolved using a genetic algorithm, allowing it to discover and refine effective teaching strategies based on the aggregate performance of diverse learners. In addition, \textbf{we propose Persona-RAG}, a Retrieval Augmented Generation module that enables student agents to retrieve knowledge tailored to their individual learning styles. Persona-RAG preserves the retrieval accuracy of standard RAG baselines while enhancing personalization, an essential factor in modeling realistic educational scenarios. Through extensive experiments, we demonstrate how our framework supports the emergence of distinct and interpretable teaching patterns when interacting with varied student populations. Our results highlight the potential of LLM-driven simulations to inform adaptive teaching practices and provide a testbed for training human educators in controlled, data-driven environments.
摘要
有效教学需要调整教学策略以适应学生多样化的认知和行为特征,这是教育及教师培训中长期存在的挑战。虽然大语言模型(LLMs)作为模拟此类复杂教学环境的工具展现出潜力,但现有仿真框架存在两个关键局限:(1)通常将学生简化为静态知识图谱;(2)缺乏建模教师根据学生反馈动态调整策略的适应机制。为弥补这些不足,我们提出了一种新型仿真框架,该框架整合了基于LLM的异构学生智能体与自优化教师智能体。教师智能体的教学策略通过遗传算法动态进化,使其能根据多样化学习者的整体表现发现并优化教学策略。此外,我们提出Persona-RAG——一个检索增强生成模块,使学生智能体能获取符合其个性化学习风格的知识。该模块在保持标准RAG基线检索准确性的同时增强了个性化程度,这是构建真实教育场景模型的关键要素。通过大量实验,我们展示了该框架如何在与不同学生群体互动时形成独特且可解释的教学模式。研究结果凸显了LLM驱动仿真在指导适应性教学实践方面的潜力,并为在受控的数据驱动环境中培训人类教师提供了实验平台。
GUARDIAN: Safeguarding LLM Multi-Agent Collaborations with Temporal Graph Modeling
Abstract
arXiv:2505.19234v1 Announce Type: new Abstract: The emergence of large language models (LLMs) enables the development of intelligent agents capable of engaging in complex and multi-turn dialogues. However, multi-agent collaboration face critical safety challenges, such as hallucination amplification and error injection and propagation. This paper presents GUARDIAN, a unified method for detecting and mitigating multiple safety concerns in GUARDing Intelligent Agent collaboratioNs. By modeling the multi-agent collaboration process as a discrete-time temporal attributed graph, GUARDIAN explicitly captures the propagation dynamics of hallucinations and errors. The unsupervised encoder-decoder architecture incorporating an incremental training paradigm, learns to reconstruct node attributes and graph structures from latent embeddings, enabling the identification of anomalous nodes and edges with unparalleled precision. Moreover, we introduce a graph abstraction mechanism based on the Information Bottleneck Theory, which compresses temporal interaction graphs while preserving essential patterns. Extensive experiments demonstrate GUARDIAN's effectiveness in safeguarding LLM multi-agent collaborations against diverse safety vulnerabilities, achieving state-of-the-art accuracy with efficient resource utilization.
摘要
大型语言模型(LLMs)的出现使得能够开发出参与复杂多轮对话的智能体。然而,多智能体协作面临关键的安全挑战,如幻觉放大以及错误注入与传播。本文提出GUARDIAN,一种用于检测和缓解智能体协作中多种安全问题的统一方法,通过将多智能体协作过程建模为离散时间时序属性图,GUARDIAN显式捕获幻觉和错误的传播动态。采用无监督编码器-解码器架构并结合增量训练范式,该方法学习从潜在嵌入中重构节点属性和图结构,从而以极高精度识别异常节点和边。此外,我们引入基于信息瓶颈理论的图抽象机制,在压缩时序交互图的同时保留关键模式。大量实验证明,GUARDIAN在保护LLM多智能体协作抵御各类安全漏洞方面具有显著效果,以高效资源利用率实现了最先进的准确率。
Sensorimotor features of self-awareness in multimodal large language models
Abstract
arXiv:2505.19237v1 Announce Type: new Abstract: Self-awareness - the ability to distinguish oneself from the surrounding environment - underpins intelligent, autonomous behavior. Recent advances in AI achieve human-like performance in tasks integrating multimodal information, particularly in large language models, raising interest in the embodiment capabilities of AI agents on nonhuman platforms such as robots. Here, we explore whether multimodal LLMs can develop self-awareness solely through sensorimotor experiences. By integrating a multimodal LLM into an autonomous mobile robot, we test its ability to achieve this capacity. We find that the system exhibits robust environmental awareness, self-recognition and predictive awareness, allowing it to infer its robotic nature and motion characteristics. Structural equation modeling reveals how sensory integration influences distinct dimensions of self-awareness and its coordination with past-present memory, as well as the hierarchical internal associations that drive self-identification. Ablation tests of sensory inputs identify critical modalities for each dimension, demonstrate compensatory interactions among sensors and confirm the essential role of structured and episodic memory in coherent reasoning. These findings demonstrate that, given appropriate sensory information about the world and itself, multimodal LLMs exhibit emergent self-awareness, opening the door to artificial embodied cognitive systems.
摘要
自我意识——即区分自身与周围环境的能力——是智能自主行为的基石。近期人工智能在多模态信息整合任务中(尤其是大语言模型)实现了类人性能,引发了人们对机器人等非人类平台上AI智能体具身能力的兴趣。本研究探讨多模态大语言模型能否仅通过感觉运动经验发展自我意识。通过将多模态大语言模型集成至自主移动机器人,我们测试了其实现该能力的可能性。研究发现该系统展现出强大的环境感知、自我识别和预测性意识,使其能够推断自身的机器人属性和运动特征。结构方程模型揭示了感觉整合如何影响自我意识的不同维度及其与过去-现在记忆的协调,以及驱动自我识别的层级化内部关联。感官输入的消融实验确定了各维度的关键模态,证明了传感器间的补偿性相互作用,并验证了结构化记忆与情景记忆在连贯推理中的核心作用。这些发现表明,当获得关于世界和自身的适当感官信息时,多模态大语言模型会呈现出涌现的自我意识,为人工具身认知系统的发展开启了新途径。
ODIN: A NL2SQL Recommender to Handle Schema Ambiguity
Abstract
arXiv:2505.19302v1 Announce Type: new Abstract: NL2SQL (natural language to SQL) systems translate natural language into SQL queries, allowing users with no technical background to interact with databases and create tools like reports or visualizations. While recent advancements in large language models (LLMs) have significantly improved NL2SQL accuracy, schema ambiguity remains a major challenge in enterprise environments with complex schemas, where multiple tables and columns with semantically similar names often co-exist. To address schema ambiguity, we introduce ODIN, a NL2SQL recommendation engine. Instead of producing a single SQL query given a natural language question, ODIN generates a set of potential SQL queries by accounting for different interpretations of ambiguous schema components. ODIN dynamically adjusts the number of suggestions based on the level of ambiguity, and ODIN learns from user feedback to personalize future SQL query recommendations. Our evaluation shows that ODIN improves the likelihood of generating the correct SQL query by 1.5-2 compared to baselines.
摘要
NL2SQL(自然语言转SQL)系统将自然语言转换为SQL查询,使非技术背景用户能够与数据库交互并创建报表或可视化等工具。尽管大语言模型(LLM)的最新进展显著提升了NL2SQL的准确率,但在具有复杂模式的企业环境中,模式歧义仍是主要挑战——这些环境中常存在多个表及语义相似的列名共存的情况。为解决模式歧义问题,我们提出了ODIN,一个NL2SQL推荐引擎。ODIN不针对自然语言问题生成单一SQL查询,而是通过考虑歧义模式组件的不同解释,生成一组潜在SQL查询。ODIN根据歧义程度动态调整建议数量,并通过学习用户反馈来个性化未来的SQL查询推荐。评估表明,与基线相比,ODIN将生成正确SQL查询的概率提高了1.5-2倍。
Evaluating Steering Techniques using Human Similarity Judgments
Abstract
arXiv:2505.19333v1 Announce Type: new Abstract: Current evaluations of Large Language Model (LLM) steering techniques focus on task-specific performance, overlooking how well steered representations align with human cognition. Using a well-established triadic similarity judgment task, we assessed steered LLMs on their ability to flexibly judge similarity between concepts based on size or kind. We found that prompt-based steering methods outperformed other methods both in terms of steering accuracy and model-to-human alignment. We also found LLMs were biased towards 'kind' similarity and struggled with 'size' alignment. This evaluation approach, grounded in human cognition, adds further support to the efficacy of prompt-based steering and reveals privileged representational axes in LLMs prior to steering.
摘要
当前对大型语言模型(LLM)引导技术的评估主要关注任务特定性能,而忽视了被引导的表征与人类认知的契合程度。本研究通过成熟的三元相似性判断任务,评估了受引导LLM在基于'尺寸'或'类别'灵活判断概念相似性的能力。研究发现,基于提示的引导方法在引导准确性和模型-人类对齐度方面均优于其他方法。同时发现LLM存在偏向'类别'相似性的偏见,且在'尺寸'维度上难以实现对齐。这种基于人类认知的评估方法,不仅进一步验证了基于提示的引导技术有效性,还揭示了LLM在未经引导前就存在的表征轴偏好。
Using Large Language Models to Assess Teachers' Pedagogical Content Knowledge
Abstract
arXiv:2505.19266v1 Announce Type: new Abstract: Assessing teachers' pedagogical content knowledge (PCK) through performance-based tasks is both time and effort-consuming. While large language models (LLMs) offer new opportunities for efficient automatic scoring, little is known about whether LLMs introduce construct-irrelevant variance (CIV) in ways similar to or different from traditional machine learning (ML) and human raters. This study examines three sources of CIV -- scenario variability, rater severity, and rater sensitivity to scenario -- in the context of video-based constructed-response tasks targeting two PCK sub-constructs: analyzing student thinking and evaluating teacher responsiveness. Using generalized linear mixed models (GLMMs), we compared variance components and rater-level scoring patterns across three scoring sources: human raters, supervised ML, and LLM. Results indicate that scenario-level variance was minimal across tasks, while rater-related factors contributed substantially to CIV, especially in the more interpretive Task II. The ML model was the most severe and least sensitive rater, whereas the LLM was the most lenient. These findings suggest that the LLM contributes to scoring efficiency while also introducing CIV as human raters do, yet with varying levels of contribution compared to supervised ML. Implications for rater training, automated scoring design, and future research on model interpretability are discussed.
摘要
评估教师的学科教学知识(PCK)通过基于表现的任务既耗时又费力。尽管大语言模型(LLM)为高效自动评分提供了新机遇,但目前尚不清楚LLM是否会在与传统机器学习(ML)和人类评分者相似或不同的方式下引入构念无关变异(CIV)。本研究在基于视频的建构反应任务背景下,考察了三种CIV来源——情境变异性、评分者严厉度以及评分者对情境的敏感性——这些任务针对PCK的两个子构念:分析学生思维和评估教师回应能力。通过广义线性混合模型(GLMM),我们比较了三种评分来源(人类评分者、监督式ML和LLM)的方差成分和评分者层面的评分模式。结果显示,跨任务的情境水平方差极小,而评分者相关因素对CIV贡献显著,尤其在更具解释性的任务II中。ML模型是最严厉且敏感性最低的评分者,而LLM则最为宽松。这些发现表明,LLM在提升评分效率的同时,也像人类评分者一样引入了CIV,但其贡献程度与监督式ML有所不同。研究还讨论了对评分者培训、自动评分设计以及未来模型可解释性研究的启示。
Style2Code: A Style-Controllable Code Generation Framework with Dual-Modal Contrastive Representation Learning
Abstract
arXiv:2505.19442v1 Announce Type: new Abstract: Controllable code generation, the ability to synthesize code that follows a specified style while maintaining functionality, remains a challenging task. We propose a two-stage training framework combining contrastive learning and conditional decoding to enable flexible style control. The first stage aligns code style representations with semantic and structural features. In the second stage, we fine-tune a language model (e.g., Flan-T5) conditioned on the learned style vector to guide generation. Our method supports style interpolation and user personalization via lightweight mixing. Compared to prior work, our unified framework offers improved stylistic control without sacrificing code correctness. This is among the first approaches to combine contrastive alignment with conditional decoding for style-guided code generation.
摘要
可控代码生成是指在保持功能性的同时合成符合特定风格代码的能力,这仍是一项具有挑战性的任务。我们提出了一种结合对比学习和条件解码的两阶段训练框架,以实现灵活的风格控制。第一阶段将代码风格表示与语义和结构特征对齐。第二阶段,我们基于学习到的风格向量对语言模型(如Flan-T5)进行微调以指导生成。我们的方法通过轻量级混合支持风格插值和用户个性化定制。与现有工作相比,该统一框架在不牺牲代码正确性的前提下提供了更好的风格控制能力。这是首个将对比对齐与条件解码相结合来实现风格引导代码生成的方法之一。
Architectures of Error: A Philosophical Inquiry into AI and Human Code Generation
Abstract
arXiv:2505.19353v1 Announce Type: new Abstract: With the rise of generative AI (GenAI), Large Language Models are increasingly employed for code generation, becoming active co-authors alongside human programmers. Focusing specifically on this application domain, this paper articulates distinct ``Architectures of Error'' to ground an epistemic distinction between human and machine code generation. Examined through their shared vulnerability to error, this distinction reveals fundamentally different causal origins: human-cognitive versus artificial-stochastic. To develop this framework and substantiate the distinction, the analysis draws critically upon Dennett's mechanistic functionalism and Rescher's methodological pragmatism. I argue that a systematic differentiation of these error profiles raises critical philosophical questions concerning semantic coherence, security robustness, epistemic limits, and control mechanisms in human-AI collaborative software development. The paper also utilizes Floridi's levels of abstraction to provide a nuanced understanding of how these error dimensions interact and may evolve with technological advancements. This analysis aims to offer philosophers a structured framework for understanding GenAI's unique epistemological challenges, shaped by these architectural foundations, while also providing software engineers a basis for more critically informed engagement.
摘要
随着生成式人工智能(GenAI)的兴起,大语言模型日益被用于代码生成,成为与人类程序员并肩的活跃合著者。本文聚焦这一特定应用领域,提出独特的"错误架构"理论框架,以确立人类与机器在代码生成层面的认知差异。通过分析二者共有的错误脆弱性,这种差异揭示了根本不同的因果起源:人类认知型错误与人工随机型错误。为构建该框架并验证其区分效度,本研究批判性借鉴了丹尼特的机械功能主义与雷谢尔的方法论实用主义。笔者认为,系统区分这两类错误模式将引发关于人机协作软件开发中语义连贯性、安全鲁棒性、认知边界及控制机制等关键哲学问题。本文还运用弗洛里迪的抽象层级理论,对这些错误维度的交互作用及其可能随技术发展的演变路径进行了精细化阐释。该分析旨在为哲学家提供理解生成式人工智能独特认识论挑战的结构化框架(这些挑战由上述架构基础所塑造),同时为软件工程师开展更具批判性的实践提供理论基础。
CaseEdit: Enhancing Localized Commonsense Reasoning via Null-Space Constrained Knowledge Editing in Small Parameter Language Models
Abstract
arXiv:2505.19383v1 Announce Type: new Abstract: Large language models (LLMs) exhibit strong performance on factual recall and general reasoning but struggle to adapt to user-specific, commonsense knowledge, a challenge particularly acute in small-parameter settings where computational efficiency is prioritized. We introduce CaseEdit, a new dataset and generation pipeline for evaluating localized, personalized commonsense knowledge editing in small LLMs to address this. Built upon the ATOMIC20/20 commonsense graph, CaseEdit uses a multi-stage inference process to generate both typical and atypical contextual edits for household objects, paired with targeted evaluation questions across four axes: reliability, generalization, locality, and portability. We evaluate established knowledge editing methods using CaseEdit and demonstrate that AlphaEdit, a technique employing null-space projection to minimize interference with unrelated knowledge, consistently outperforms other methods when applied to an LLaMA 3.2 3B model, even in scalability tests, showing minimal ripple effects. Our results indicate that using CaseEdit with effective editing techniques like AlphaEdit allows small models to internalize high-quality, context-sensitive common-sense knowledge, paving the way for lightweight, personalized assistants.
摘要
大语言模型(LLMs)在事实回忆和通用推理方面表现优异,但难以适应用户特定的常识知识,这一挑战在优先考虑计算效率的小参数量场景中尤为突出。为此,我们提出CaseEdit——一个用于评估小型LLMs中局部化、个性化常识知识编辑的新数据集与生成流程。该工作基于ATOMIC20/20常识图谱,通过多阶段推理过程生成家用物品的典型与非典型上下文编辑内容,并配套针对可靠性、泛化性、局部性和可迁移性四个维度的评估问题。我们使用CaseEdit评估现有知识编辑方法,结果表明:采用零空间投影技术以最小化无关知识干扰的AlphaEdit方法,在LLaMA 3.2 3B模型上持续优于其他方法,即使在可扩展性测试中也仅产生微小涟漪效应。研究证实,通过CaseEdit与AlphaEdit等高效编辑技术结合,可使小模型内化高质量、上下文敏感的常识知识,为轻量级个性化助手的发展铺平道路。
Recalibrating the Compass: Integrating Large Language Models into Classical Research Methods
Abstract
arXiv:2505.19402v1 Announce Type: new Abstract: This paper examines how large language models (LLMs) are transforming core quantitative methods in communication research in particular, and in the social sciences more broadly-namely, content analysis, survey research, and experimental studies. Rather than replacing classical approaches, LLMs introduce new possibilities for coding and interpreting text, simulating dynamic respondents, and generating personalized and interactive stimuli. Drawing on recent interdisciplinary work, the paper highlights both the potential and limitations of LLMs as research tools, including issues of validity, bias, and interpretability. To situate these developments theoretically, the paper revisits Lasswell's foundational framework -- "Who says what, in which channel, to whom, with what effect?" -- and demonstrates how LLMs reconfigure message studies, audience analysis, and effects research by enabling interpretive variation, audience trajectory modeling, and counterfactual experimentation. Revisiting the metaphor of the methodological compass, the paper argues that classical research logics remain essential as the field integrates LLMs and generative AI. By treating LLMs not only as technical instruments but also as epistemic and cultural tools, the paper calls for thoughtful, rigorous, and imaginative use of LLMs in future communication and social science research.
摘要
本文探讨了大型语言模型(LLM)如何变革传播学研究乃至更广泛社会科学领域的核心定量方法,特别是内容分析、调查研究和实验研究。LLM并非取代传统方法,而是为文本编码与解释、动态受访者模拟以及个性化交互式刺激生成提供了新的可能性。基于近期跨学科研究成果,本文既强调了LLM作为研究工具的潜力,也指出了其在效度、偏差和可解释性等方面的局限。为从理论层面定位这些发展,本文重新审视了拉斯韦尔的基础框架——'谁通过什么渠道向谁说了什么并产生什么效果?',并论证LLM如何通过实现解释变异、受众轨迹建模和反事实实验,重构了信息研究、受众分析和效果研究。通过重温方法论罗盘的隐喻,本文指出在整合LLM和生成式AI的过程中,经典研究逻辑仍然不可或缺。通过将LLM不仅视为技术工具,更作为认知与文化工具,本文呼吁在未来传播学与社会科学研究中以深思熟虑、严谨且富有想象力的方式运用LLM。
Origin Tracer: A Method for Detecting LoRA Fine-Tuning Origins in LLMs
Abstract
arXiv:2505.19466v1 Announce Type: new Abstract: As large language models (LLMs) continue to advance, their deployment often involves fine-tuning to enhance performance on specific downstream tasks. However, this customization is sometimes accompanied by misleading claims about the origins, raising significant concerns about transparency and trust within the open-source community. Existing model verification techniques typically assess functional, representational, and weight similarities. However, these approaches often struggle against obfuscation techniques, such as permutations and scaling transformations. To address this limitation, we propose a novel detection method Origin-Tracer that rigorously determines whether a model has been fine-tuned from a specified base model. This method includes the ability to extract the LoRA rank utilized during the fine-tuning process, providing a more robust verification framework. This framework is the first to provide a formalized approach specifically aimed at pinpointing the sources of model fine-tuning. We empirically validated our method on thirty-one diverse open-source models under conditions that simulate real-world obfuscation scenarios. We empirically analyze the effectiveness of our framework and finally, discuss its limitations. The results demonstrate the effectiveness of our approach and indicate its potential to establish new benchmarks for model verification.
摘要
随着大语言模型(LLM)的持续发展,其部署通常涉及针对特定下游任务的微调以提升性能。然而这种定制化过程时常伴随关于模型来源的误导性声明,引发了开源社区对透明度和信任的严重关切。现有模型验证技术主要评估功能、表征和权重层面的相似性,但这些方法往往难以应对置换和尺度变换等混淆技术。为突破这一局限,我们提出了一种新型检测方法Origin-Tracer,该方法能严格判定模型是否基于指定基础模型进行过微调,包括提取微调过程中使用的LoRA秩,从而构建更鲁棒的验证框架。该框架首次提供了专门用于追溯模型微调来源的形式化方法。我们在模拟真实混淆场景的条件下,对三十一个多样化开源模型进行了实证验证,分析了框架的有效性并探讨了其局限性。实验结果证明了本方法的有效性,并显示出其有望为模型验证建立新基准的潜力。
Genome-Bench: A Scientific Reasoning Benchmark from Real-World Expert Discussions
Abstract
arXiv:2505.19501v1 Announce Type: new Abstract: In this short report, we present an automated pipeline tailored for the genomics domain and introduce \textit{Genome-Bench}, a new benchmark constructed from over a decade of scientific forum discussions on genome engineering. Our pipeline transforms raw interactions into a reinforcement learning friendly multiple-choice questions format, supported by 3000+ high quality question answer pairs spanning foundational biology, experimental troubleshooting, tool usage, and beyond. To our knowledge, this is the first end-to-end pipeline for teaching LLMs to reason from scientific discussions, with promising potential for generalization across scientific domains beyond biology.
摘要
在这份简短报告中,我们提出了一个专为基因组学领域设计的自动化流程,并介绍了Genome-Bench——一个基于十余年基因组工程科学论坛讨论构建的新型基准测试。该流程将原始互动数据转化为适合强化学习的多选题形式,包含3000余个涵盖基础生物学、实验故障排除、工具使用等方面的高质量问答对。据我们所知,这是首个教导大语言模型从科学讨论中推理的端到端流程,在生物学之外的其他科学领域也具有广阔的推广潜力。
Causal-LLaVA: Causal Disentanglement for Mitigating Hallucination in Multimodal Large Language Models
Abstract
arXiv:2505.19474v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have demonstrated strong performance in visual understanding tasks, yet they often suffer from object hallucinations--generating descriptions of objects that are inconsistent with or entirely absent from the input. This issue is closely related to dataset biases, where frequent co-occurrences of objects lead to entangled semantic representations across modalities. As a result, models may erroneously activate object representations that are commonly associated with the input but not actually present. To address this, we propose a causality-driven disentanglement framework that mitigates hallucinations through causal intervention. Our approach includes a Causal-Driven Projector in the visual pathway and a Causal Intervention Module integrated into the final transformer layer of the language model. These components work together to reduce spurious correlations caused by biased training data. Experimental results show that our method significantly reduces hallucinations while maintaining strong performance on multiple multimodal benchmarks. Visualization analyses further confirm improved separability of object representations. The code is available at: https://github.com/IgniSavium/Causal-LLaVA
摘要
多模态大语言模型(MLLMs)在视觉理解任务中展现出强大性能,但普遍存在物体幻觉问题——生成与输入内容不符或完全不存在的物体描述。该问题与数据集偏差密切相关,即物体频繁共现导致跨模态语义表征纠缠,使得模型可能错误激活与输入常见关联但实际未出现的物体表征。
为此,我们提出一种因果驱动的解耦框架,通过因果干预缓解幻觉现象。该方法在视觉通路中引入因果驱动投影器,并在语言模型最终Transformer层集成因果干预模块,协同降低有偏训练数据导致的虚假相关性。
实验结果表明,本方法在保持多模态基准性能的同时显著减少幻觉现象。可视化分析进一步证实物体表征可分离性得到提升。 代码发布于:https://github.com/IgniSavium/Causal-LLaVA
Unveiling the Compositional Ability Gap in Vision-Language Reasoning Model
Abstract
arXiv:2505.19406v1 Announce Type: new Abstract: While large language models (LLMs) demonstrate strong reasoning capabilities utilizing reinforcement learning (RL) with verifiable reward, whether large vision-language models (VLMs) can directly inherit such capabilities through similar post-training strategies remains underexplored. In this work, we conduct a systematic compositional probing study to evaluate whether current VLMs trained with RL or other post-training strategies can compose capabilities across modalities or tasks under out-of-distribution conditions. We design a suite of diagnostic tasks that train models on unimodal tasks or isolated reasoning skills, and evaluate them on multimodal, compositional variants requiring skill integration. Through comparisons between supervised fine-tuning (SFT) and RL-trained models, we identify three key findings: (1) RL-trained models consistently outperform SFT on compositional generalization, demonstrating better integration of learned skills; (2) although VLMs achieve strong performance on individual tasks, they struggle to generalize compositionally under cross-modal and cross-task scenario, revealing a significant gap in current training strategies; (3) enforcing models to explicitly describe visual content before reasoning (e.g., caption-before-thinking), along with rewarding progressive vision-to-text grounding, yields notable gains. It highlights two essential ingredients for improving compositionality in VLMs: visual-to-text alignment and accurate visual grounding. Our findings shed light on the current limitations of RL-based reasoning VLM training and provide actionable insights toward building models that reason compositionally across modalities and tasks.
摘要
尽管大语言模型(LLMs)通过可验证奖励的强化学习(RL)展现出强大的推理能力,但大视觉语言模型(VLMs)能否通过类似的后训练策略直接继承这种能力仍待探索。本研究通过系统性组合探针实验,评估当前采用RL或其他后训练策略的VLMs在分布外条件下能否跨模态或跨任务组合能力。我们设计了一套诊断任务,使模型在单模态任务或孤立推理技能上训练,并在需要技能整合的多模态组合变体上测试。通过对比监督微调(SFT)与RL训练模型,发现三个关键结论:(1)RL训练模型在组合泛化上持续优于SFT,表现出更好的技能整合能力;(2)尽管VLMs在单项任务上表现优异,但在跨模态和跨任务的组合泛化中存在显著困难,揭示了当前训练策略的不足;(3)强制模型在推理前显式描述视觉内容(如'描述-再思考'策略)并奖励渐进式视觉-文本 grounding 能带来显著提升。这凸显了提升VLM组合性的两个关键要素:视觉-文本对齐与精确的视觉 grounding。我们的发现揭示了当前基于RL的VLM推理训练的局限性,并为构建跨模态和跨任务组合推理模型提供了可行方向。
Task Memory Engine: Spatial Memory for Robust Multi-Step LLM Agents
Abstract
arXiv:2505.19436v1 Announce Type: new Abstract: Large Language Models (LLMs) falter in multi-step interactions -- often hallucinating, repeating actions, or misinterpreting user corrections -- due to reliance on linear, unstructured context. This fragility stems from the lack of persistent memory to track evolving goals and task dependencies, undermining trust in autonomous agents. We introduce the Task Memory Engine (TME), a modular memory controller that transforms existing LLMs into robust, revision-aware agents without fine-tuning. TME implements a spatial memory framework that replaces flat context with graph-based structures to support consistent, multi-turn reasoning. Departing from linear concatenation and ReAct-style prompting, TME builds a dynamic task graph -- either a tree or directed acyclic graph (DAG) -- to map user inputs to subtasks, align them with prior context, and enable dependency-tracked revisions. Its Task Representation and Intent Management (TRIM) component models task semantics and user intent to ensure accurate interpretation. Across four multi-turn scenarios-trip planning, cooking, meeting scheduling, and shopping cart editing -- TME eliminates 100% of hallucinations and misinterpretations in three tasks, and reduces hallucinations by 66.7% and misinterpretations by 83.3% across 27 user turns, outperforming ReAct. TME's modular design supports plug-and-play deployment and domain-specific customization, adaptable to both personal assistants and enterprise automation. We release TME's codebase, benchmarks, and components as open-source resources, enabling researchers to develop reliable LLM agents. TME's scalable architecture addresses a critical gap in agent performance across complex, interactive settings.
摘要
大型语言模型(LLMs)在多步交互中存在明显缺陷——常出现幻觉、重复操作或误解用户修正——这源于其对线性非结构化上下文的依赖。这种脆弱性是由于缺乏持续记忆来追踪动态目标和任务依赖关系,从而削弱了自主代理的可信度。我们提出任务记忆引擎(TME),一种模块化记忆控制器,无需微调即可将现有LLMs转化为具备修订感知能力的鲁棒代理。TME采用空间记忆框架,用基于图的结构取代扁平化上下文,以支持连贯的多轮推理。不同于线性拼接和ReAct式提示,TME构建动态任务图(树状或有向无环图)来将用户输入映射至子任务,使其与先验上下文对齐,并实现依赖追踪的修订。其任务表征与意图管理(TRIM)组件通过建模任务语义和用户意图确保准确解析。在旅行规划、烹饪、会议安排和购物车编辑四个多轮场景中,TME在三个任务中完全消除幻觉和误解现象,并在27轮用户交互中整体减少66.7%的幻觉和83.3%的误判,性能超越ReAct。TME的模块化设计支持即插即用部署和领域定制,可适配个人助手与企业自动化场景。我们开源TME的代码库、基准测试及组件,助力研究者开发可靠LLM代理。该可扩展架构填补了复杂交互场景下代理性能的关键空白。
Judging with Many Minds: Do More Perspectives Mean Less Prejudice?
Abstract
arXiv:2505.19477v1 Announce Type: new Abstract: LLM-as-Judge has emerged as a scalable alternative to human evaluation, enabling large language models (LLMs) to provide reward signals in trainings. While recent work has explored multi-agent extensions such as multi-agent debate and meta-judging to enhance evaluation quality, the question of how intrinsic biases manifest in these settings remains underexplored. In this study, we conduct a systematic analysis of four diverse bias types: position bias, verbosity bias, chain-of-thought bias, and bandwagon bias. We evaluate these biases across two widely adopted multi-agent LLM-as-Judge frameworks: Multi-Agent-Debate and LLM-as-Meta-Judge. Our results show that debate framework amplifies biases sharply after the initial debate, and this increased bias is sustained in subsequent rounds, while meta-judge approaches exhibit greater resistance. We further investigate the incorporation of PINE, a leading single-agent debiasing method, as a bias-free agent within these systems. The results reveal that this bias-free agent effectively reduces biases in debate settings but provides less benefit in meta-judge scenarios. Our work provides a comprehensive study of bias behavior in multi-agent LLM-as-Judge systems and highlights the need for targeted bias mitigation strategies in collaborative evaluation settings.
摘要
LLM-as-Judge(大语言模型作为评判者)已成为人类评估的可扩展替代方案,使大语言模型(LLMs)能够在训练中提供奖励信号。尽管近期研究探索了多智能体扩展(如多智能体辩论和元评判)以提升评估质量,但这些场景中内在偏见如何显现的问题仍未得到充分研究。在本研究中,我们对四种不同类型的偏见进行了系统分析:位置偏见、冗长偏见、思维链偏见和从众偏见。我们在两种广泛采用的多智能体LLM-as-Judge框架(多智能体辩论和LLM-as-元评判)中评估了这些偏见。结果表明,辩论框架在初始辩论后偏见急剧放大,且这种增加的偏见在后续轮次中持续存在,而元评判方法表现出更强的抵抗性。我们进一步研究了将领先的单智能体去偏方法PINE作为无偏见智能体引入这些系统的效果。结果显示,该无偏见智能体能有效减少辩论设置中的偏见,但在元评判场景中益处有限。本研究全面探讨了多智能体LLM-as-Judge系统中的偏见行为,并强调了在协作评估场景中需要针对性偏见缓解策略的重要性。
Benchmarking and Enhancing LLM Agents in Localizing Linux Kernel Bugs
Abstract
arXiv:2505.19489v1 Announce Type: new Abstract: The Linux kernel is a critical system, serving as the foundation for numerous systems. Bugs in the Linux kernel can cause serious consequences, affecting billions of users. Fault localization (FL), which aims at identifying the buggy code elements in software, plays an essential role in software quality assurance. While recent LLM agents have achieved promising accuracy in FL on recent benchmarks like SWE-bench, it remains unclear how well these methods perform in the Linux kernel, where FL is much more challenging due to the large-scale code base, limited observability, and diverse impact factors. In this paper, we introduce LinuxFLBench, a FL benchmark constructed from real-world Linux kernel bugs. We conduct an empirical study to assess the performance of state-of-the-art LLM agents on the Linux kernel. Our initial results reveal that existing agents struggle with this task, achieving a best top-1 accuracy of only 41.6% at file level. To address this challenge, we propose LinuxFL, an enhancement framework designed to improve FL effectiveness of LLM agents for the Linux kernel. LinuxFL substantially improves the FL accuracy of all studied agents (e.g., 7.2% - 11.2% accuracy increase) with minimal costs. Data and code are available at https://github.com/FudanSELab/LinuxFLBench.
摘要
Linux内核作为支撑众多系统的关键基础设施,其缺陷可能导致影响数十亿用户的严重后果。故障定位(FL)技术通过识别软件中的缺陷代码元素,在质量保障中发挥着核心作用。尽管当前大语言模型智能体在SWE-bench等基准测试中展现出良好的FL准确率,但其在Linux内核中的表现尚不明确——由于代码规模庞大、可观测性受限及影响因素复杂,内核FL任务更具挑战性。本文提出LinuxFLBench,一个基于真实内核缺陷构建的FL基准测试集,并通过实证研究评估前沿大语言模型智能体在内核环境中的表现。实验结果表明,现有智能体在此任务中表现欠佳,文件级定位的最高top-1准确率仅为41.6%。为此,我们设计LinuxFL增强框架以提升LLM智能体在内核FL中的效能。该框架以极小成本显著提高了所有测试智能体的定位准确率(如7.2%-11.2%的提升幅度)。相关数据与代码已开源:https://github.com/FudanSELab/LinuxFLBench。
VLMLight: Traffic Signal Control via Vision-Language Meta-Control and Dual-Branch Reasoning
Abstract
arXiv:2505.19486v1 Announce Type: new Abstract: Traffic signal control (TSC) is a core challenge in urban mobility, where real-time decisions must balance efficiency and safety. Existing methods - ranging from rule-based heuristics to reinforcement learning (RL) - often struggle to generalize to complex, dynamic, and safety-critical scenarios. We introduce VLMLight, a novel TSC framework that integrates vision-language meta-control with dual-branch reasoning. At the core of VLMLight is the first image-based traffic simulator that enables multi-view visual perception at intersections, allowing policies to reason over rich cues such as vehicle type, motion, and spatial density. A large language model (LLM) serves as a safety-prioritized meta-controller, selecting between a fast RL policy for routine traffic and a structured reasoning branch for critical cases. In the latter, multiple LLM agents collaborate to assess traffic phases, prioritize emergency vehicles, and verify rule compliance. Experiments show that VLMLight reduces waiting times for emergency vehicles by up to 65% over RL-only systems, while preserving real-time performance in standard conditions with less than 1% degradation. VLMLight offers a scalable, interpretable, and safety-aware solution for next-generation traffic signal control.
摘要
交通信号控制(TSC)是城市交通中的核心挑战,其实时决策需兼顾效率与安全性。现有方法——从基于规则的启发式到强化学习(RL)——往往难以泛化至复杂、动态且安全至上的场景。本文提出VLMLight,一种融合视觉语言元控制与双分支推理的新型TSC框架。其核心是首个基于图像的交通模拟器,可实现交叉路口的全景视觉感知,使策略能解析车辆类型、运动状态及空间密度等丰富信息。大型语言模型(LLM)作为安全优先的元控制器,在常规交通的快速RL策略与关键场景的结构化推理分支间动态切换。后者通过多LLM智能体协作,评估交通相位、优先调度应急车辆并验证规则合规性。实验表明,相较于纯RL系统,VLMLight将应急车辆等待时间缩短达65%,同时在标准条件下保持实时性能(延迟率低于1%)。该框架为下一代交通信号控制提供了可扩展、可解释且安全感知的解决方案。
Automated CAD Modeling Sequence Generation from Text Descriptions via Transformer-Based Large Language Models
Abstract
arXiv:2505.19490v1 Announce Type: new Abstract: Designing complex computer-aided design (CAD) models is often time-consuming due to challenges such as computational inefficiency and the difficulty of generating precise models. We propose a novel language-guided framework for industrial design automation to address these issues, integrating large language models (LLMs) with computer-automated design (CAutoD).Through this framework, CAD models are automatically generated from parameters and appearance descriptions, supporting the automation of design tasks during the detailed CAD design phase. Our approach introduces three key innovations: (1) a semi-automated data annotation pipeline that leverages LLMs and vision-language large models (VLLMs) to generate high-quality parameters and appearance descriptions; (2) a Transformer-based CAD generator (TCADGen) that predicts modeling sequences via dual-channel feature aggregation; (3) an enhanced CAD modeling generation model, called CADLLM, that is designed to refine the generated sequences by incorporating the confidence scores from TCADGen. Experimental results demonstrate that the proposed approach outperforms traditional methods in both accuracy and efficiency, providing a powerful tool for automating industrial workflows and generating complex CAD models from textual prompts. The code is available at https://jianxliao.github.io/cadllm-page/
摘要
设计复杂的计算机辅助设计(CAD)模型常因计算效率低下和生成精确模型的困难而耗时。为解决这些问题,我们提出了一种新颖的语言引导工业设计自动化框架,将大语言模型(LLMs)与计算机自动化设计(CAutoD)相结合。该框架通过参数和外观描述自动生成CAD模型,支持详细CAD设计阶段的任务自动化。我们的方法包含三项关键创新:(1)利用LLMs和视觉语言大模型(VLLMs)生成高质量参数与外观描述的半自动数据标注流程;(2)基于Transformer的双通道特征聚合CAD生成器(TCADGen),用于预测建模序列;(3)改进的CAD建模生成模型CADLLM,通过整合TCADGen的置信度分数优化生成序列。实验结果表明,所提方法在精度和效率上均优于传统方法,为工业流程自动化及文本提示生成复杂CAD模型提供了强大工具。代码详见https://jianxliao.github.io/cadllm-page/
BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs
Abstract
arXiv:2505.19457v1 Announce Type: new Abstract: Large language models excel in general tasks, yet assessing their reliability in logic-heavy, precision-critical domains like finance, law, and healthcare remains challenging. To address this, we introduce BizFinBench, the first benchmark specifically designed to evaluate LLMs in real-world financial applications. BizFinBench consists of 6,781 well-annotated queries in Chinese, spanning five dimensions: numerical calculation, reasoning, information extraction, prediction recognition, and knowledge-based question answering, grouped into nine fine-grained categories. The benchmark includes both objective and subjective metrics. We also introduce IteraJudge, a novel LLM evaluation method that reduces bias when LLMs serve as evaluators in objective metrics. We benchmark 25 models, including both proprietary and open-source systems. Extensive experiments show that no model dominates across all tasks. Our evaluation reveals distinct capability patterns: (1) In Numerical Calculation, Claude-3.5-Sonnet (63.18) and DeepSeek-R1 (64.04) lead, while smaller models like Qwen2.5-VL-3B (15.92) lag significantly; (2) In Reasoning, proprietary models dominate (ChatGPT-o3: 83.58, Gemini-2.0-Flash: 81.15), with open-source models trailing by up to 19.49 points; (3) In Information Extraction, the performance spread is the largest, with DeepSeek-R1 scoring 71.46, while Qwen3-1.7B scores 11.23; (4) In Prediction Recognition, performance variance is minimal, with top models scoring between 39.16 and 50.00. We find that while current LLMs handle routine finance queries competently, they struggle with complex scenarios requiring cross-concept reasoning. BizFinBench offers a rigorous, business-aligned benchmark for future research. The code and dataset are available at https://github.com/HiThink-Research/BizFinBench.
摘要
大语言模型在通用任务中表现卓越,但评估其在金融、法律和医疗等逻辑密集、精度至关键领域的可靠性仍具挑战性。为此,我们推出BizFinBench——首个专为评估大语言模型在真实金融场景中应用性能而设计的基准测试。该基准包含6,781条经过精细标注的中文查询,涵盖数值计算、逻辑推理、信息抽取、预测识别和知识问答五个维度,细分为九个子类别,同时采用客观与主观双重评估指标。我们还提出IteraJudge这一创新的大语言模型评估方法,可有效降低模型作为评估者时的客观指标偏差。我们对25个专有及开源模型进行了全面测试,实验表明没有任何模型能在所有任务中占据优势。评估结果揭示了显著的能力分化:(1)数值计算任务中Claude-3.5-Sonnet(63.18)与DeepSeek-R1(64.04)领先,而Qwen2.5-VL-3B(15.92)等小模型表现欠佳;(2)逻辑推理领域由专有模型主导(ChatGPT-o3:83.58,Gemini-2.0-Flash:81.15),开源模型最大落后19.49分;(3)信息抽取任务性能差异最为显著,DeepSeek-R1达71.46分,Qwen3-1.7B仅11.23分;(4)预测识别任务中各模型表现趋同,最优模型得分介于39.16至50.00之间。研究发现,当前大语言模型虽能胜任常规金融查询,但在需要跨概念推理的复杂场景中仍存在局限。BizFinBench为未来研究提供了严格贴合商业实践的评估基准,代码与数据集已开源于https://github.com/HiThink-Research/BizFinBench。
Customising Electricity Contracts at Scale with Large Language Models
Abstract
arXiv:2505.19551v1 Announce Type: new Abstract: The electricity system becomes more complex, connecting massive numbers of end-users and distributed generators. Adding or removing grid connections requires expert studies to align technical constraints with user requests. In times of labour shortages, carrying out these studies represents a significant amount of time that engineers at system operators spend in planning departments. As time is limited, only standard block connectivity contracts can be offered to end-users, or the requests pile up. Even if offers are made, these often do not perfectly match the user's requirements, leading to overpaying or underusing the grid capacity. This paper investigates whether end-users can negotiate individual, flexible time-of-use contracts directly with the grid using Large Language Models (LLM) in chats at scale. The LLM-based chat has direct access to a model of the grid and studies the grid's technical constraints just as an expert engineer. The advantage of this system is that end-users can directly interact with grid models through natural language; no intermediate is needed to service, analyse, study, assess, advise, consult and engineer. This initial study paves the way toward developing this tailored LLM system, resulting in possible high-efficiency gains for grid planning and customer management.
摘要
电力系统正变得日益复杂,需要连接海量终端用户和分布式发电设备。新增或移除电网连接需通过专家研究来协调技术约束与用户需求。在劳动力短缺时期,开展这些研究占据了系统运营商工程师在规划部门的大量工作时间。由于时间有限,运营商只能向终端用户提供标准区块连接合约,或导致需求积压。即便提供合约方案,也常无法完全匹配用户需求,造成电网容量过度付费或利用不足。本文研究终端用户能否通过大规模聊天交互,利用大型语言模型(LLM)直接与电网协商个性化的灵活分时用电合约。基于LLM的聊天系统可直接访问电网模型,并像专业工程师一样研究电网技术约束。该系统的优势在于终端用户可通过自然语言直接与电网模型交互,无需中间环节进行服务、分析、研究、评估、建议、咨询和工程设计。这项初步研究为开发定制化LLM系统奠定基础,有望为电网规划和客户管理带来显著效率提升。
Turing Test 2.0: The General Intelligence Threshold
Abstract
arXiv:2505.19550v1 Announce Type: new Abstract: With the rise of artificial intelligence (A.I.) and large language models like Chat-GPT, a new race for achieving artificial general intelligence (A.G.I) has started. While many speculate how and when A.I. will achieve A.G.I., there is no clear agreement on how A.G.I. can be detected in A.I. models, even when popular tools like the Turing test (and its modern variations) are used to measure their intelligence. In this work, we discuss why traditional methods like the Turing test do not suffice for measuring or detecting A.G.I. and provide a new, practical method that can be used to decide if a (computer or any other) system has reached or surpassed A.G.I. To achieve this, we make two new contributions. First, we present a clear definition for general intelligence (G.I.) and set a G.I. threshold (G.I.T.) that can be used to distinguish between systems that achieve A.G.I. and systems that do not. Second, we present a new framework on how to construct tests that can detect if a system has achieved G.I. in a simple, comprehensive, and clear-cut fail/pass way. We call this novel framework the Turing Tests 2.0. We then demonstrate real-life examples of applying tests that follow our Turing Tests 2.0 framework on modern A.I. models.
摘要
随着人工智能(A.I.)和Chat-GPT等大型语言模型的兴起,一场关于实现人工通用智能(A.G.I.)的新竞赛已然展开。尽管众多研究者推测A.I.实现A.G.I.的方式与时间节点,但关于如何在A.I.模型中检测A.G.I.仍缺乏明确共识——即便使用图灵测试(及其现代变体)等流行工具来评估其智能水平。本研究论述了为何图灵测试等传统方法不足以衡量或检测A.G.I.,并提出了一种可实际用于判定(计算机或其他)系统是否达到或超越A.G.I.的新方法。为此,我们作出两项新贡献:首先提出通用智能(G.I.)的明确定义,并设立可用于区分是否达成A.G.I.的通用智能阈值(G.I.T.);其次构建新型测试框架,以简单、全面且非黑即白的通过/失败方式检测系统是否实现G.I.。我们将这一创新框架命名为"图灵测试2.0",并通过在现代A.I.模型上应用符合该框架测试的实际案例进行实证展示。
Automated Text-to-Table for Reasoning-Intensive Table QA: Pipeline Design and Benchmarking Insights
Abstract
arXiv:2505.19563v1 Announce Type: new Abstract: Reasoning with tabular data holds increasing importance in modern applications, yet comprehensive evaluation methodologies for reasoning-intensive Table Question Answering (QA) tasks remain nascent. Existing research is constrained by two primary bottlenecks: 1) Reliance on costly manually annotated real-world data, which is difficult to cover complex reasoning scenarios; 2) The heterogeneity of table structures hinders systematic analysis of the intrinsic mechanisms behind the underperformance of LLMs, especially in reasoning-intensive tasks. To address these issues, we propose an automated generation pipeline AutoT2T that transforms mathematical word problems into table-based reasoning tasks, eliminating the need for manual annotation. The pipeline can generate multiple variants of a table for the same reasoning problem, including noisy versions to support robustness evaluation. Based on this, we construct a new benchmark TabularGSM, which systematically spans a range of table complexities and trap problems. Experimental analyses through AutoT2T and TabularGSM reveal that the tight coupling between reasoning and retrieval or identification processes is a key factor underlying the failure of LLMs in complex Table QA tasks. This highlights the necessity for models to develop synergistic reasoning capabilities in order to perform effectively in complex Table QA tasks.
摘要
在现代应用中,基于表格数据的推理日益重要,然而针对推理密集型表格问答(QA)任务的综合评估方法仍处于起步阶段。现有研究主要受限于两个瓶颈:1)依赖成本高昂的人工标注真实数据,难以覆盖复杂推理场景;2)表格结构的异质性阻碍了对大语言模型(LLMs)在推理密集型任务中表现不佳的内在机制进行系统分析。为解决这些问题,我们提出自动化生成流程AutoT2T,将数学文字问题转化为基于表格的推理任务,无需人工标注。该流程能针对同一推理问题生成包括支持鲁棒性评估的噪声版本在内的多种表格变体。基于此,我们构建了新基准TabularGSM,系统覆盖不同复杂度表格及陷阱问题。通过AutoT2T和TabularGSM的实验分析表明,推理与检索或识别过程的紧密耦合是LLMs在复杂表格QA任务中失败的关键因素,这凸显了模型需发展协同推理能力以有效应对复杂表格QA任务的必要性。
AMQA: An Adversarial Dataset for Benchmarking Bias of LLMs in Medicine and Healthcare
Abstract
arXiv:2505.19562v1 Announce Type: new Abstract: Large language models (LLMs) are reaching expert-level accuracy on medical diagnosis questions, yet their mistakes and the biases behind them pose life-critical risks. Bias linked to race, sex, and socioeconomic status is already well known, but a consistent and automatic testbed for measuring it is missing. To fill this gap, this paper presents AMQA -- an Adversarial Medical Question-Answering dataset -- built for automated, large-scale bias evaluation of LLMs in medical QA. AMQA includes 4,806 medical QA pairs sourced from the United States Medical Licensing Examination (USMLE) dataset, generated using a multi-agent framework to create diverse adversarial descriptions and question pairs. Using AMQA, we benchmark five representative LLMs and find surprisingly substantial disparities: even GPT-4.1, the least biased model tested, answers privileged-group questions over 10 percentage points more accurately than unprivileged ones. Compared with the existing benchmark CPV, AMQA reveals 15% larger accuracy gaps on average between privileged and unprivileged groups. Our dataset and code are publicly available at https://github.com/XY-Showing/AMQA to support reproducible research and advance trustworthy, bias-aware medical AI.
摘要
大型语言模型(LLMs)在医学诊断问题上已达到专家级准确度,但其错误及背后的偏见仍存在危及生命的风险。与种族、性别和社会经济地位相关的偏见已广为人知,但尚缺乏一致且自动化的测试平台来衡量这些偏见。为填补这一空白,本文提出AMQA——一个对抗性医学问答数据集——专为医学QA中LLMs的自动化、大规模偏见评估而构建。AMQA包含4,806个医学问答对,源自美国医师执照考试(USMLE)数据集,通过多智能体框架生成多样化的对抗性描述和问题对。利用AMQA,我们对五种代表性LLMs进行基准测试,发现存在惊人的显著差异:即使是被测试模型中偏见最少的GPT-4.1,其对特权群体问题的回答准确率仍比非特权群体高出10个百分点以上。与现有基准CPV相比,AMQA揭示的特权与非特权群体间准确率差距平均扩大15%。我们的数据集和代码已公开于https://github.com/XY-Showing/AMQA,以支持可重复研究并推动可信赖、具有偏见意识的医疗AI发展。
Think Again! The Effect of Test-Time Compute on Preferences, Opinions, and Beliefs of Large Language Models
Abstract
arXiv:2505.19621v1 Announce Type: new Abstract: As Large Language Models (LLMs) become deeply integrated into human life and increasingly influence decision-making, it's crucial to evaluate whether and to what extent they exhibit subjective preferences, opinions, and beliefs. These tendencies may stem from biases within the models, which may shape their behavior, influence the advice and recommendations they offer to users, and potentially reinforce certain viewpoints. This paper presents the Preference, Opinion, and Belief survey (POBs), a benchmark developed to assess LLMs' subjective inclinations across societal, cultural, ethical, and personal domains. We applied our benchmark to evaluate leading open- and closed-source LLMs, measuring desired properties such as reliability, neutrality, and consistency. In addition, we investigated the effect of increasing the test-time compute, through reasoning and self-reflection mechanisms, on those metrics. While effective in other tasks, our results show that these mechanisms offer only limited gains in our domain. Furthermore, we reveal that newer model versions are becoming less consistent and more biased toward specific viewpoints, highlighting a blind spot and a concerning trend. POBS: https://ibm.github.io/POBS
摘要
随着大型语言模型(LLMs)深度融入人类生活并日益影响决策过程,评估其是否及在何种程度上表现出主观偏好、观点和信念变得至关重要。这些倾向可能源于模型内部的偏见,进而塑造其行为、影响向用户提供的建议与推荐,并可能强化特定观点。本文提出'偏好、观点与信念调查'(POBs)基准,该基准用于评估LLMs在社会、文化、伦理及个人领域的倾向性。我们运用该基准评估了领先的开源与闭源LLMs,测量了可靠性、中立性和一致性等关键属性。此外,我们通过推理与自省机制研究了增加测试时计算资源对这些指标的影响。结果显示,尽管这些机制在其他任务中有效,但在本领域仅能带来有限提升。进一步研究发现,新版模型正变得愈发不一致,且更倾向于特定观点,这揭示了当前研究的盲点及令人担忧的发展趋势。POBS项目地址:https://ibm.github.io/POBS
LLM-Agent-Controller: A Universal Multi-Agent Large Language Model System as a Control Engineer
Abstract
arXiv:2505.19567v1 Announce Type: new Abstract: This study presents the LLM-Agent-Controller, a multi-agent large language model (LLM) system developed to address a wide range of problems in control engineering (Control Theory). The system integrates a central controller agent with multiple specialized auxiliary agents, responsible for tasks such as controller design, model representation, control analysis, time-domain response, and simulation. A supervisor oversees high-level decision-making and workflow coordination, enhancing the system's reliability and efficiency. The LLM-Agent-Controller incorporates advanced capabilities, including Retrieval-Augmented Generation (RAG), Chain-of-Thought reasoning, self-criticism and correction, efficient memory handling, and user-friendly natural language communication. It is designed to function without requiring users to have prior knowledge of Control Theory, enabling them to input problems in plain language and receive complete, real-time solutions. To evaluate the system, we propose new performance metrics assessing both individual agents and the system as a whole. We test five categories of Control Theory problems and benchmark performance across three advanced LLMs. Additionally, we conduct a comprehensive qualitative conversational analysis covering all key services. Results show that the LLM-Agent-Controller successfully solved 83% of general tasks, with individual agents achieving an average success rate of 87%. Performance improved with more advanced LLMs. This research demonstrates the potential of multi-agent LLM architectures to solve complex, domain-specific problems. By integrating specialized agents, supervisory control, and advanced reasoning, the LLM-Agent-Controller offers a scalable, robust, and accessible solution framework that can be extended to various technical domains.
摘要
本研究提出LLM-Agent-Controller——一个为解决控制工程(控制理论)领域广泛问题而开发的多智能体大语言模型系统。该系统将中央控制器智能体与多个专业辅助智能体相集成,分别负责控制器设计、模型表示、控制分析、时域响应及仿真等任务。监督器负责高层决策与工作流协调,从而提升系统的可靠性和效率。该架构融合了检索增强生成、思维链推理、自我批判与修正、高效记忆处理及用户友好的自然语言交互等先进功能,其设计使得用户无需具备控制理论背景知识,仅需用自然语言输入问题即可获得完整的实时解决方案。为评估系统性能,我们提出了同时评估单个智能体与整体系统的新指标,测试了五类控制理论问题并在三种先进大语言模型上进行基准比较,还对所有核心服务进行了全面的定性对话分析。结果表明,该系统成功解决了83%的常规任务,各智能体平均成功率达87%,且性能随大语言模型升级而提升。本研究证明了多智能体大语言模型架构在解决复杂领域特定问题方面的潜力,通过整合专业智能体、监督控制与高级推理能力,该框架提供了可扩展、鲁棒且易用的解决方案,可推广至多种技术领域。
Token-Importance Guided Direct Preference Optimization
Abstract
arXiv:2505.19653v1 Announce Type: new Abstract: Ensuring that large language models (LLMs) generate outputs aligned with human preferences is important for safe and effective AI interactions. While Direct Preference Optimization (DPO) employs an implicit reward function to optimize the policy model, however, it and its related variants overlook the differential importance of individual tokens and are sensitive to judgment noise in preference datasets during generation. Although recent methods attempt to assess the important weight of tokens via probability prediction or simplistic weighting schemes, these evaluation methods are prone to biases and still cannot fully address these issues. To solve this problem, we propose the Token-Importance Guided Direct Preference Optimization (TI-DPO), which introduces two key innovations: the gradient-based token-importance weights that dynamically prioritize critical tokens, and a triple loss that explicitly guides model outputs to approach human-preferred responses and stay away from non-preferred responses. Experimental results show that TI-DPO achieves higher accuracy and stronger generative diversity, providing more stable and computationally efficient solutions compared with DPO and other RLHF methods.
摘要
确保大型语言模型(LLM)生成的输出符合人类偏好,对于实现安全有效的人工智能交互至关重要。虽然直接偏好优化(DPO)采用隐式奖励函数来优化策略模型,但其及相关变体方法忽视了单个令牌的差异性重要性,且在生成过程中对偏好数据集中的判断噪声较为敏感。尽管近期研究尝试通过概率预测或简单加权方案评估令牌的重要性权重,但这些评估方法容易产生偏差,仍无法完全解决上述问题。为此,我们提出令牌重要性引导的直接偏好优化(TI-DPO),其包含两项关键创新:基于梯度的动态令牌重要性权重机制——优先处理关键令牌,以及三重损失函数——显式引导模型输出接近人类偏好响应并远离非偏好响应。实验结果表明,与DPO及其他强化学习人类反馈(RLHF)方法相比,TI-DPO具有更高的准确性和更强的生成多样性,能提供更稳定且计算效率更优的解决方案。
MSD-LLM: Predicting Ship Detention in Port State Control Inspections with Large Language Model
Abstract
arXiv:2505.19568v1 Announce Type: new Abstract: Maritime transportation is the backbone of global trade, making ship inspection essential for ensuring maritime safety and environmental protection. Port State Control (PSC), conducted by national ports, enforces compliance with safety regulations, with ship detention being the most severe consequence, impacting both ship schedules and company reputations. Traditional machine learning methods for ship detention prediction are limited by the capacity of representation learning and thus suffer from low accuracy. Meanwhile, autoencoder-based deep learning approaches face challenges due to the severe data imbalance in learning historical PSC detention records. To address these limitations, we propose Maritime Ship Detention with Large Language Models (MSD-LLM), integrating a dual robust subspace recovery (DSR) layer-based autoencoder with a progressive learning pipeline to handle imbalanced data and extract meaningful PSC representations. Then, a large language model groups and ranks features to identify likely detention cases, enabling dynamic thresholding for flexible detention predictions. Extensive evaluations on 31,707 PSC inspection records from the Asia-Pacific region show that MSD-LLM outperforms state-of-the-art methods more than 12% on Area Under the Curve (AUC) for Singapore ports. Additionally, it demonstrates robustness to real-world challenges, making it adaptable to diverse maritime risk assessment scenarios.
摘要
海事运输是全球贸易的支柱,船舶检查对保障海上安全和环境保护至关重要。港口国监督(PSC)作为各国港口实施的监管机制,通过强制遵守安全法规来确保航行安全,其中船舶滞留是最严厉的处罚措施,会对船舶调度和公司声誉造成重大影响。传统机器学习方法因表征学习能力有限,导致船舶滞留预测准确率较低;而基于自动编码器的深度学习方法则因港口国监督滞留记录存在严重数据不平衡问题面临挑战。为突破这些局限,我们提出基于大语言模型的海事船舶滞留预测框架(MSD-LLM),通过集成双鲁棒子空间恢复层的自动编码器与渐进式学习流程,有效处理不平衡数据并提取有意义的港口国监督特征表示。随后利用大语言模型对特征进行分组排序以识别潜在滞留案例,并通过动态阈值实现灵活的滞留预测。基于亚太地区31,707条港口国监督检查记录的实验表明,该框架在新加坡港口的曲线下面积(AUC)指标上以超过12%的优势优于现有最优方法,同时展现出对实际应用挑战的强鲁棒性,可适应多样化海事风险评估场景。
Large Language Models' Reasoning Stalls: An Investigation into the Capabilities of Frontier Models
Abstract
arXiv:2505.19676v1 Announce Type: new Abstract: Empirical methods to examine the capability of Large Language Models (LLMs) to use Automated Theorem Prover (ATP) reasoning strategies are studied. We evaluate the performance of State of the Art models from December 2023 and August 2024 on PRONTOQA steamroller reasoning problems. For that, we develop methods for assessing LLM response accuracy and correct answer correlation. Our results show that progress in improving LLM reasoning abilities has stalled over the nine month period. By tracking completion tokens, we show that almost all improvement in reasoning ability since GPT-4 was released can be attributed to either hidden system prompts or the training of models to automatically use generic Chain of Thought prompting strategies. Among the ATP reasoning strategies tried, we found that current frontier LLMs are best able to follow the bottom-up (also known as forward-chaining) strategy. A low positive correlation was found between an LLM response containing correct reasoning and arriving at the correct conclusion.
摘要
本文研究了评估大语言模型(LLM)运用自动定理证明器(ATP)推理策略能力的实证方法。我们评估了2023年12月至2024年8月期间最先进模型在PRONTOQA steamroller推理问题上的表现。为此,我们开发了评估LLM响应准确性和正确答案相关性的方法。研究结果表明,在九个月期间,LLM推理能力的提升进展陷入停滞。通过追踪完成标记,我们发现自GPT-4发布以来,几乎所有推理能力的提升都可归因于隐藏系统提示或训练模型自动使用通用思维链提示策略。在尝试的ATP推理策略中,当前前沿LLM最擅长遵循自底向上(又称前向链接)策略。研究发现,LLM响应包含正确推理与得出正确结论之间存在较低的正相关性。
FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks
Abstract
arXiv:2505.19662v1 Announce Type: new Abstract: This paper proposes FieldWorkArena, a benchmark for agentic AI targeting real-world field work. With the recent increase in demand for agentic AI, they are required to monitor and report safety and health incidents, as well as manufacturing-related incidents, that may occur in real-world work environments. Existing agentic AI benchmarks have been limited to evaluating web tasks and are insufficient for evaluating agents in real-world work environments, where complexity increases significantly. In this paper, we define a new action space that agentic AI should possess for real world work environment benchmarks and improve the evaluation function from previous methods to assess the performance of agentic AI in diverse real-world tasks. The dataset consists of videos captured on-site and documents actually used in factories and warehouses, and tasks were created based on interviews with on-site workers and managers. Evaluation results confirmed that performance evaluation considering the characteristics of Multimodal LLM (MLLM) such as GPT-4o is feasible. Additionally, the effectiveness and limitations of the proposed new evaluation method were identified. The complete dataset (HuggingFace) and evaluation program (GitHub) can be downloaded from the following website: https://en-documents.research.global.fujitsu.com/fieldworkarena/.
摘要
本文提出FieldWorkArena基准,旨在针对现实世界现场工作的代理人工智能进行评估。随着近期对代理AI需求的增长,这些系统需要监测并报告现实工作环境中可能发生的安全健康事件及制造相关事故。现有代理AI基准仅限于评估网络任务,无法充分评估在复杂度显著提升的现实工作环境中的代理性能。本研究定义了代理AI在现实工作环境基准中应具备的新动作空间,并改进了先前方法的评估函数,以评估代理AI在多样化现实任务中的表现。数据集由现场拍摄视频及工厂仓库实际使用文档构成,任务设计基于对现场工人和管理者的访谈。评估结果证实,考虑GPT-4o等多模态大语言模型(MLLM)特性的性能评估具有可行性。同时明确了所提新评估方法的有效性和局限性。完整数据集(HuggingFace)和评估程序(GitHub)可从以下网站下载:https://en-documents.research.global.fujitsu.com/fieldworkarena/。
Large Language Models for Planning: A Comprehensive and Systematic Survey
Abstract
arXiv:2505.19683v1 Announce Type: new Abstract: Planning represents a fundamental capability of intelligent agents, requiring comprehensive environmental understanding, rigorous logical reasoning, and effective sequential decision-making. While Large Language Models (LLMs) have demonstrated remarkable performance on certain planning tasks, their broader application in this domain warrants systematic investigation. This paper presents a comprehensive review of LLM-based planning. Specifically, this survey is structured as follows: First, we establish the theoretical foundations by introducing essential definitions and categories about automated planning. Next, we provide a detailed taxonomy and analysis of contemporary LLM-based planning methodologies, categorizing them into three principal approaches: 1) External Module Augmented Methods that combine LLMs with additional components for planning, 2) Finetuning-based Methods that involve using trajectory data and feedback signals to adjust LLMs in order to improve their planning abilities, and 3) Searching-based Methods that break down complex tasks into simpler components, navigate the planning space, or enhance decoding strategies to find the best solutions. Subsequently, we systematically summarize existing evaluation frameworks, including benchmark datasets, evaluation metrics and performance comparisons between representative planning methods. Finally, we discuss the underlying mechanisms enabling LLM-based planning and outline promising research directions for this rapidly evolving field. We hope this survey will serve as a valuable resource to inspire innovation and drive progress in this field.
摘要
规划是智能体的核心能力,需要综合的环境理解、严谨的逻辑推理和有效的序列决策。尽管大语言模型(LLMs)在某些规划任务中表现出卓越性能,但其在该领域的广泛应用仍需系统研究。本文对基于LLM的规划方法进行了全面综述:首先通过介绍自动化规划的基本定义与分类建立理论基础;其次详细梳理了当前基于LLM的规划方法学,将其归纳为三大类——1)外部模块增强法:通过结合附加组件与LLMs协同规划,2)微调法:利用轨迹数据与反馈信号调整LLMs以提升规划能力,3)搜索法:将复杂任务分解为简单组件、遍历规划空间或优化解码策略以寻求最优解;随后系统总结了现有评估框架,包括基准数据集、评价指标及代表性规划方法的性能对比;最后探讨了LLM实现规划的内在机制,并展望了这一快速发展领域的潜在研究方向。本综述旨在为该领域的创新研究提供有价值的参考,推动相关技术进步。
ReChisel: Effective Automatic Chisel Code Generation by LLM with Reflection
Abstract
arXiv:2505.19734v1 Announce Type: new Abstract: Coding with hardware description languages (HDLs) such as Verilog is a time-intensive and laborious task. With the rapid advancement of large language models (LLMs), there is increasing interest in applying LLMs to assist with HDL coding. Recent efforts have demonstrated the potential of LLMs in translating natural language to traditional HDL Verilog. Chisel, a next-generation HDL based on Scala, introduces higher-level abstractions, facilitating more concise, maintainable, and scalable hardware designs. However, the potential of using LLMs for Chisel code generation remains largely unexplored. This work proposes ReChisel, an LLM-based agentic system designed to enhance the effectiveness of Chisel code generation. ReChisel incorporates a reflection mechanism to iteratively refine the quality of generated code using feedback from compilation and simulation processes, and introduces an escape mechanism to break free from non-progress loops. Experiments demonstrate that ReChisel significantly improves the success rate of Chisel code generation, achieving performance comparable to state-of-the-art LLM-based agentic systems for Verilog code generation.
摘要
使用Verilog等硬件描述语言(HDL)进行编码是一项耗时且繁琐的任务。随着大语言模型(LLM)的快速发展,人们越来越关注如何应用LLM辅助HDL编码。近期研究表明,LLM在将自然语言转换为传统HDL Verilog方面具有潜力。Chisel作为基于Scala的下一代HDL,引入了更高层次的抽象,有助于实现更简洁、可维护和可扩展的硬件设计。然而,利用LLM生成Chisel代码的潜力尚未得到充分探索。本研究提出ReChisel,一个基于LLM的代理系统,旨在提升Chisel代码生成的效率。ReChisel通过集成反射机制,利用编译和仿真过程的反馈迭代优化生成代码质量,并引入逃逸机制以跳出非进展循环。实验表明,ReChisel显著提高了Chisel代码生成的成功率,其性能可与最先进的基于LLM的Verilog代码生成代理系统相媲美。
Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning Models
Abstract
arXiv:2505.19690v1 Announce Type: new Abstract: Despite the remarkable proficiency of \textit{Large Reasoning Models} (LRMs) in handling complex reasoning tasks, their reliability in safety-critical scenarios remains uncertain. Existing evaluations primarily assess response-level safety, neglecting a critical issue we identify as \textbf{\textit{Superficial Safety Alignment} (SSA)} -- a phenomenon where models produce superficially safe outputs while internal reasoning processes fail to genuinely detect and mitigate underlying risks, resulting in inconsistent safety behaviors across multiple sampling attempts. To systematically investigate SSA, we introduce \textbf{Beyond Safe Answers (BSA)} bench, a novel benchmark comprising 2,000 challenging instances organized into three distinct SSA scenario types and spanning nine risk categories, each meticulously annotated with risk rationales. Evaluations of 19 state-of-the-art LRMs demonstrate the difficulty of this benchmark, with top-performing models achieving only 38.0% accuracy in correctly identifying risk rationales. We further explore the efficacy of safety rules, specialized fine-tuning on safety reasoning data, and diverse decoding strategies in mitigating SSA. Our work provides a comprehensive assessment tool for evaluating and improving safety reasoning fidelity in LRMs, advancing the development of genuinely risk-aware and reliably safe AI systems.
摘要
尽管大型推理模型(LRMs)在处理复杂推理任务方面表现出卓越能力,但其在安全关键场景中的可靠性仍存在不确定性。现有评估主要关注响应层面的安全性,却忽视了我们发现的关键问题——表面安全对齐(SSA)。该现象表现为模型生成表面安全的输出,而其内部推理过程未能真正识别和缓解潜在风险,导致多次采样尝试中出现不一致的安全行为。为系统研究SSA,我们提出超越安全答案(BSA)基准,该新型基准包含2,000个挑战性实例,分为三种SSA场景类型,涵盖九大风险类别,每个实例均经过风险原理的精细标注。对19个最先进LRMs的评估表明该基准具有较高难度,表现最佳的模型在正确识别风险原理方面仅达到38.0%准确率。我们进一步探究了安全规则、针对安全推理数据的专项微调以及多样化解码策略在缓解SSA方面的有效性。本研究为评估和提升LRMs的安全推理保真度提供了全面评估工具,推动了真正具备风险意识且可靠安全的人工智能系统的发展。
SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond
Abstract
arXiv:2505.19641v1 Announce Type: new Abstract: Recent advances such as OpenAI-o1 and DeepSeek R1 have demonstrated the potential of Reinforcement Learning (RL) to enhance reasoning abilities in Large Language Models (LLMs). While open-source replication efforts have primarily focused on mathematical and coding domains, methods and resources for developing general reasoning capabilities remain underexplored. This gap is partly due to the challenge of collecting diverse and verifiable reasoning data suitable for RL. We hypothesize that logical reasoning is critical for developing general reasoning capabilities, as logic forms a fundamental building block of reasoning. In this work, we present SynLogic, a data synthesis framework and dataset that generates diverse logical reasoning data at scale, encompassing 35 diverse logical reasoning tasks. The SynLogic approach enables controlled synthesis of data with adjustable difficulty and quantity. Importantly, all examples can be verified by simple rules, making them ideally suited for RL with verifiable rewards. In our experiments, we validate the effectiveness of RL training on the SynLogic dataset based on 7B and 32B models. SynLogic leads to state-of-the-art logical reasoning performance among open-source datasets, surpassing DeepSeek-R1-Distill-Qwen-32B by 6 points on BBEH. Furthermore, mixing SynLogic data with mathematical and coding tasks improves the training efficiency of these domains and significantly enhances reasoning generalization. Notably, our mixed training model outperforms DeepSeek-R1-Zero-Qwen-32B across multiple benchmarks. These findings position SynLogic as a valuable resource for advancing the broader reasoning capabilities of LLMs. We open-source both the data synthesis pipeline and the SynLogic dataset at https://github.com/MiniMax-AI/SynLogic.
摘要
OpenAI-o1和DeepSeek R1等最新进展证明了强化学习(RL)在增强大语言模型(LLMs)推理能力方面的潜力。尽管开源复现工作主要集中在数学和编程领域,但开发通用推理能力的方法和资源仍未被充分探索。这一空白部分源于难以收集适合RL训练的多样化且可验证的推理数据。我们假设逻辑推理是发展通用推理能力的关键,因为逻辑构成推理的基础构建模块。本研究提出SynLogic——一个可规模化生成多样化逻辑推理数据的数据合成框架与数据集,涵盖35类不同的逻辑推理任务。SynLogic方法能按需调节数据难度与数量进行可控合成。重要的是,所有示例均可通过简单规则验证,使其特别适合搭配可验证奖励机制的RL训练。实验基于7B和32B模型验证了SynLogic数据集上RL训练的有效性:在开源数据集中,SynLogic实现了最先进的逻辑推理性能,在BBEH基准上以6分优势超越DeepSeek-R1-Distill-Qwen-32B。此外,将SynLogic数据与数学及编程任务混合训练,能提升这些领域的训练效率并显著增强推理泛化能力。值得注意的是,我们的混合训练模型在多个基准测试中全面优于DeepSeek-R1-Zero-Qwen-32B。这些发现使SynLogic成为推进LLMs广义推理能力的重要资源。我们已在https://github.com/MiniMax-AI/SynLogic开源数据合成管道与SynLogic数据集。
Divide and Conquer: Grounding LLMs as Efficient Decision-Making Agents via Offline Hierarchical Reinforcement Learning
Abstract
arXiv:2505.19761v1 Announce Type: new Abstract: While showing sophisticated reasoning abilities, large language models (LLMs) still struggle with long-horizon decision-making tasks due to deficient exploration and long-term credit assignment, especially in sparse-reward scenarios. Inspired by the divide-and-conquer principle, we propose an innovative framework GLIDER (Grounding Language Models as EffIcient Decision-Making Agents via Offline HiErarchical Reinforcement Learning) that introduces a parameter-efficient and generally applicable hierarchy to LLM policies. We develop a scheme where the low-level controller is supervised with abstract, step-by-step plans that are learned and instructed by the high-level policy. This design decomposes complicated problems into a series of coherent chain-of-thought reasoning sub-tasks, providing flexible temporal abstraction to significantly enhance exploration and learning for long-horizon tasks. Furthermore, GLIDER facilitates fast online adaptation to non-stationary environments owing to the strong transferability of its task-agnostic low-level skills. Experiments on ScienceWorld and ALFWorld benchmarks show that GLIDER achieves consistent performance gains, along with enhanced generalization capabilities.
摘要
尽管大型语言模型(LLMs)展现出复杂的推理能力,但由于探索不足和长期信用分配问题,其在长时程决策任务中仍存在困难,尤其在稀疏奖励场景下。受分治原则启发,我们提出创新框架GLIDER(Grounding Language Models as EffIcient Decision-Making Agents via Offline HiErarchical Reinforcement Learning),该框架为LLM策略引入了一种参数高效且普遍适用的层次结构。我们设计了一种方案,其中低级控制器通过高级策略学习并指导的抽象分步计划进行监督。该设计将复杂问题分解为一系列连贯的思维链推理子任务,通过灵活的时间抽象显著增强长时程任务的探索与学习能力。此外,得益于其任务无关低级技能的强可迁移性,GLIDER能够快速在线适应非平稳环境。在ScienceWorld和ALFWorld基准测试上的实验表明,GLIDER实现了持续的性能提升,并展现出更强的泛化能力。
Concise Reasoning, Big Gains: Pruning Long Reasoning Trace with Difficulty-Aware Prompting
Abstract
arXiv:2505.19716v1 Announce Type: new Abstract: Existing chain-of-thought (CoT) distillation methods can effectively transfer reasoning abilities to base models but suffer from two major limitations: excessive verbosity of reasoning traces and inadequate adaptability to problem difficulty. Long reasoning traces significantly increase inference costs, and uniform-length solutions prevent base models from learning adaptive reasoning strategies. To address these issues, we propose a difficulty-aware prompting (DAP) method to dynamically shorten reasoning traces without performance loss. In our approach, a large teacher model first judges each problem's difficulty and then rewrites its reasoning traces to an appropriate shorter length, yielding concise yet complete reasoning traces. Leveraging the DAP pipeline, we curate a distilled dataset called LiteCoT consisting of 100K concise reasoning examples, with solutions averaging only 720 tokens (an order of magnitude shorter than typical CoTs). Using LiteCoT, we distilled a new family of reasoning models called Liter (1.5B, 7B, and 32B) based on the Qwen2.5 architecture. Experiments show that a student model fine-tuned on just 100K of these difficulty-pruned CoT samples outperforms a model distilled on 800K original Long CoT samples, while significantly reducing training and inference costs. Our method also generalizes well: across 11 diverse benchmarks, the shorter difficulty-aware CoTs achieve equal or better accuracy than Long chains, using far fewer tokens. For example, on the challenging AIME24 exam, our approach reaches Pass@1 using only about 5K inference tokens, surpassing other methods that consume many more tokens. Our code and data are available at https://github.com/Evanwu1125/LiteCoT.
摘要
现有思维链(CoT)蒸馏方法能有效将推理能力迁移至基础模型,但存在两大局限:推理轨迹过于冗长及对问题难度适应性不足。冗长的推理轨迹显著增加推理成本,而统一长度的解决方案阻碍基础模型学习自适应推理策略。为解决这些问题,我们提出难度感知提示(DAP)方法,在不损失性能的前提下动态缩短推理轨迹。该方法首先由大型教师模型判断问题难度,随后将其推理轨迹改写为适当缩短的长度,从而生成简洁完整的推理轨迹。基于DAP流程,我们构建了包含10万条精简推理样本的LiteCoT蒸馏数据集,其解决方案平均仅720个token(比典型CoT缩短一个数量级)。使用LiteCoT数据集,我们在Qwen2.5架构上蒸馏出新型推理模型系列Liter(1.5B/7B/32B)。实验表明,仅用10万条经难度筛选的CoT样本微调的学生模型,其性能优于基于80万条原始长CoT样本蒸馏的模型,同时显著降低训练和推理成本。该方法泛化性良好:在11个多样化基准测试中,较短的难度感知CoT使用更少token即可达到与长链相同或更高的准确率。例如在AIME24高难度考试中,我们的方法仅消耗约5K推理token即达到74.2%的Pass@1,优于其他消耗更多token的方法。代码与数据详见https://github.com/Evanwu1125/LiteCoT。
FinLoRA: Benchmarking LoRA Methods for Fine-Tuning LLMs on Financial Datasets
Abstract
arXiv:2505.19819v1 Announce Type: new Abstract: Low-rank adaptation (LoRA) methods show great potential for scaling pre-trained general-purpose Large Language Models (LLMs) to hundreds or thousands of use scenarios. However, their efficacy in high-stakes domains like finance is rarely explored, e.g., passing CFA exams and analyzing SEC filings. In this paper, we present the open-source FinLoRA project that benchmarks LoRA methods on both general and highly professional financial tasks. First, we curated 19 datasets covering diverse financial applications; in particular, we created four novel XBRL analysis datasets based on 150 SEC filings. Second, we evaluated five LoRA methods and five base LLMs. Finally, we provide extensive experimental results in terms of accuracy, F1, and BERTScore and report computational cost in terms of time and GPU memory during fine-tuning and inference stages. We find that LoRA methods achieved substantial performance gains of 36% on average over base models. Our FinLoRA project provides an affordable and scalable approach to democratize financial intelligence to the general public. Datasets, LoRA adapters, code, and documentation are available at https://github.com/Open-Finance-Lab/FinLoRA
摘要
低秩自适应(LoRA)方法在将预训练的通用大语言模型(LLM)扩展至数百甚至数千种应用场景方面展现出巨大潜力。然而,其在金融等高风险领域的有效性鲜少被探索,例如通过CFA考试和分析美国证券交易委员会(SEC)文件。本文提出开源项目FinLoRA,对LoRA方法在通用及高度专业化金融任务上的表现进行基准测试。首先,我们整理了涵盖多样化金融应用的19个数据集;特别地,基于150份SEC文件创建了四个新颖的XBRL分析数据集。其次,我们评估了五种LoRA方法和五种基础LLM。最后,我们从准确率、F1值和BERTScore等维度提供了大量实验结果,并报告了微调与推理阶段的时间及GPU内存计算成本。研究发现,LoRA方法相较基础模型平均实现了36%的性能提升。FinLoRA项目为公众提供了一种经济、可扩展的金融智能化普及方案。数据集、LoRA适配器、代码及文档详见https://github.com/Open-Finance-Lab/FinLoRA。
Done Is Better than Perfect: Unlocking Efficient Reasoning by Structured Multi-Turn Decomposition
Abstract
arXiv:2505.19788v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) are criticized for the excessively lengthy Chain-of-Thought (CoT) to derive the final answer, suffering from high first-token and overall latency. Typically, the CoT of LRMs mixes multiple thinking units; each unit attempts to produce a candidate answer to the original query. Hence, a natural idea to improve efficiency is to reduce the unit number. Yet, the fact that the thinking units in vanilla CoT cannot be explicitly managed renders doing so challenging. This paper introduces Multi-Turn Decomposition (MinD) to decode conventional CoT into a sequence of explicit, structured, and turn-wise interactions to bridge the gap. In MinD, the model provides a multi-turn response to the query, where each turn embraces a thinking unit and yields a corresponding answer. The subsequent turns can reflect, verify, revise, or explore alternative approaches to both the thinking and answer parts of earlier ones. This not only makes the answer delivered more swiftly, but also enables explicit controls over the iterative reasoning process (i.e., users may halt or continue at any turn). We follow a supervised fine-tuning (SFT) then reinforcement learning (RL) paradigm to realize MinD. We first rephrase the outputs of an LRM into multi-turn formats by prompting another LLM, and then tune the LRM with such data. Observing that the tuned model tends to consume even more tokens than the original one (probably due to that the multi-turn formats introduce additional answer tokens), we advocate leveraging RL algorithms like GRPO to prioritize correct outputs with fewer turns. Trained on the MATH dataset using R1-Distill models, MinD can achieve up to ~70% reduction in both output token usage and time to first token (TTFT), while maintaining competitive performance on reasoning benchmarks such as MATH-500, AIME24, AMC23, and GPQA-Diamond.
摘要
大型推理模型(LRMs)因生成最终答案时需要过长的思维链(CoT)而受到批评,存在首词延迟和总体延迟过高的问题。通常,LRMs的CoT混合了多个思维单元,每个单元试图为原始查询生成一个候选答案。因此,提高效率的自然思路是减少思维单元数量。然而,传统CoT中的思维单元无法显式管理,使得这一目标难以实现。本文提出多轮分解(MinD)方法,将传统CoT解码为一系列显式、结构化、轮次化的交互以弥合这一差距。在MinD中,模型对查询提供多轮响应,每轮包含一个思维单元并生成相应答案。后续轮次可对先前轮次的思维部分和答案部分进行反思、验证、修正或探索替代方案。这不仅使答案更快呈现,还能实现对迭代推理过程的显式控制(用户可在任意轮次停止或继续)。我们采用监督微调(SFT)结合强化学习(RL)的范式实现MinD:首先通过提示另一个LLM将LRM的输出重述为多轮格式,随后用此类数据微调LRM。发现微调后的模型倾向于消耗比原始模型更多的token(可能因多轮格式引入了额外答案token),我们主张采用GRPO等RL算法优先选择轮次更少的正确输出。在MATH数据集上使用R1-Distill模型训练的MinD,能在保持MATH-500、AIME24、AMC23和GPQA-Diamond等推理基准竞争力的同时,实现输出token使用量和首词时间(TTFT)最高约70%的降低。
DGRAG: Distributed Graph-based Retrieval-Augmented Generation in Edge-Cloud Systems
Abstract
arXiv:2505.19847v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) has emerged as a promising approach to enhance the capabilities of language models by integrating external knowledge. Due to the diversity of data sources and the constraints of memory and computing resources, real-world data is often scattered in multiple devices. Conventional RAGs that store massive amounts of scattered data centrally face increasing privacy concerns and high computational costs. Additionally, RAG in a central node raises latency issues when searching over a large-scale knowledge base. To address these challenges, we propose a distributed Knowledge Graph-based RAG approach, referred to as DGRAG, in an edge-cloud system, where each edge device maintains a local knowledge base without the need to share it with the cloud, instead sharing only summaries of its knowledge. Specifically, DGRAG has two main phases. In the Distributed Knowledge Construction phase, DGRAG organizes local knowledge using knowledge graphs, generating subgraph summaries and storing them in a summary database in the cloud as information sharing. In the Collaborative Retrieval and Generation phase, DGRAG first performs knowledge retrieval and answer generation locally, and a gate mechanism determines whether the query is beyond the scope of local knowledge or processing capabilities. For queries that exceed the local knowledge scope, the cloud retrieves knowledge from the most relevant edges based on the summaries and generates a more precise answer. Experimental results demonstrate the effectiveness of the proposed DGRAG approach in significantly improving the quality of question-answering tasks over baseline approaches.
摘要
检索增强生成(RAG)作为一种通过整合外部知识来增强语言模型能力的方法,已展现出广阔前景。由于数据源的多样性与内存、计算资源的限制,现实世界中的数据往往分散存储于多个设备中。传统RAG方案集中存储海量分散数据,不仅面临日益严峻的隐私问题,还伴随高昂的计算成本。此外,在中央节点实施RAG时,大规模知识库检索会引发延迟问题。为应对这些挑战,我们提出一种基于分布式知识图的RAG方法(简称DGRAG),部署于边缘-云系统中。该方法中,每个边缘设备维护本地知识库而无需共享原始数据,仅通过知识摘要实现信息交互。具体而言,DGRAG包含两个核心阶段:在分布式知识构建阶段,系统利用知识图谱组织本地知识,生成子图摘要并存储于云端的摘要数据库以实现信息共享;在协同检索与生成阶段,DGRAG首先在本地执行知识检索与答案生成,并通过门控机制判断查询是否超出本地知识范围或处理能力。对于超出本地知识范围的查询,云端将根据摘要从最相关的边缘设备检索知识,并生成更精确的答案。实验结果表明,相较于基线方法,所提出的DGRAG方案能显著提升问答任务的质量。
HS-STAR: Hierarchical Sampling for Self-Taught Reasoners via Difficulty Estimation and Budget Reallocation
Abstract
arXiv:2505.19866v1 Announce Type: new Abstract: Self-taught reasoners (STaRs) enhance the mathematical reasoning abilities of large language models (LLMs) by leveraging self-generated responses for self-training. Recent studies have incorporated reward models to guide response selection or decoding, aiming to obtain higher-quality data. However, they typically allocate a uniform sampling budget across all problems, overlooking the varying utility of problems at different difficulty levels. In this work, we conduct an empirical study and find that problems near the boundary of the LLM's reasoning capability offer significantly greater learning utility than both easy and overly difficult ones. To identify and exploit such problems, we propose HS-STaR, a Hierarchical Sampling framework for Self-Taught Reasoners. Given a fixed sampling budget, HS-STaR first performs lightweight pre-sampling with a reward-guided difficulty estimation strategy to efficiently identify boundary-level problems. Subsequently, it dynamically reallocates the remaining budget toward these high-utility problems during a re-sampling phase, maximizing the generation of valuable training data. Extensive experiments across multiple reasoning benchmarks and backbone LLMs demonstrate that HS-STaR significantly outperforms other baselines without requiring additional sampling budget.
摘要
自学推理器(STaRs)通过利用自生成的响应进行自我训练,增强了大型语言模型(LLMs)的数学推理能力。近期研究引入奖励模型以指导响应选择或解码,旨在获取更高质量的数据。然而,这些方法通常对所有问题分配统一的采样预算,忽视了不同难度问题在效用上的差异。本研究通过实证分析发现,位于模型推理能力边界附近的问题,其学习效用显著高于简单或过度困难的问题。为识别并利用此类问题,我们提出HS-STaR框架——一种面向自学推理器的分层采样方法。在固定采样预算下,HS-STaR首先采用基于奖励的难度评估策略进行轻量级预采样,高效定位边界级问题;随后在重采样阶段动态将剩余预算重新分配给这些高效用问题,从而最大化有价值训练数据的生成。跨多个推理基准和骨干LLMs的广泛实验表明,HS-STaR在不增加采样预算的前提下,显著优于其他基线方法。
TCP: a Benchmark for Temporal Constraint-Based Planning
Abstract
arXiv:2505.19927v1 Announce Type: new Abstract: Temporal reasoning and planning are essential capabilities for large language models (LLMs), yet most existing benchmarks evaluate them in isolation and under limited forms of complexity. To address this gap, we introduce the Temporal Constraint-based Planning (TCP) benchmark, that jointly assesses both capabilities. Each instance in TCP features a naturalistic dialogue around a collaborative project, where diverse and interdependent temporal constraints are explicitly or implicitly expressed, and models must infer an optimal schedule that satisfies all constraints. To construct TCP, we first generate abstract problem prototypes that are paired with realistic scenarios from various domains and enriched into dialogues using an LLM. A human quality check is performed on a sampled subset to confirm the reliability of our benchmark. We evaluate state-of-the-art LLMs and find that even the strongest models struggle with TCP, highlighting its difficulty and revealing limitations in LLMs' temporal constraint-based planning abilities. We analyze underlying failure cases, open source our benchmark, and hope our findings can inspire future research.
摘要
时间推理与规划是大语言模型(LLMs)的核心能力,但现有基准测试大多孤立评估这两项能力且复杂度有限。为弥补这一不足,我们提出基于时间约束的规划(TCP)基准,该基准可联合评估上述双重能力。TCP每个实例围绕协作项目构建自然对话,其中显性或隐式包含多样且相互依赖的时间约束,模型必须推断出满足所有约束的最优时间表。TCP的构建首先生成抽象问题原型,将其与多领域现实场景配对,并利用LLM扩展为对话。通过对抽样子集的人工质检,我们验证了基准的可靠性。评估表明,即使最先进的LLMs在TCP上也表现不佳,凸显其难度并揭示LLMs在基于时间约束的规划能力上的局限。我们分析了典型错误案例,开源了基准测试集,期望研究成果能推动未来探索。
Large Language Models as Autonomous Spacecraft Operators in Kerbal Space Program
Abstract
arXiv:2505.19896v1 Announce Type: new Abstract: Recent trends are emerging in the use of Large Language Models (LLMs) as autonomous agents that take actions based on the content of the user text prompts. We intend to apply these concepts to the field of Control in space, enabling LLMs to play a significant role in the decision-making process for autonomous satellite operations. As a first step towards this goal, we have developed a pure LLM-based solution for the Kerbal Space Program Differential Games (KSPDG) challenge, a public software design competition where participants create autonomous agents for maneuvering satellites involved in non-cooperative space operations, running on the KSP game engine. Our approach leverages prompt engineering, few-shot prompting, and fine-tuning techniques to create an effective LLM-based agent that ranked 2nd in the competition. To the best of our knowledge, this work pioneers the integration of LLM agents into space research. The project comprises several open repositories to facilitate replication and further research. The codebase is accessible on \href{https://github.com/ARCLab-MIT/kspdg}{GitHub}, while the trained models and datasets are available on \href{https://huggingface.co/OhhTuRnz}{Hugging Face}. Additionally, experiment tracking and detailed results can be reviewed on \href{https://wandb.ai/carrusk/huggingface}{Weights & Biases
摘要
当前出现了一种新趋势,即利用大型语言模型(LLMs)作为自主代理,根据用户文本提示的内容采取行动。我们计划将这些概念应用于空间控制领域,使LLMs在自主卫星操作的决策过程中发挥重要作用。作为实现该目标的第一步,我们为Kerbal太空计划差分博弈(KSPDG)挑战开发了一个纯基于LLM的解决方案。KSPDG是一项公开的软件设计竞赛,参赛者需创建自主代理,用于在KSP游戏引擎上操控参与非合作空间操作的卫星。我们的方法结合了提示工程、少样本提示和微调技术,开发出一个高效的基于LLM的代理,并在竞赛中获得第二名。据我们所知,这项工作首次将LLM代理集成到空间研究中。该项目包含多个开放仓库,以便于复现和进一步研究。代码库可在GitHub上获取,而训练好的模型和数据集则发布于Hugging Face。此外,实验跟踪和详细结果可在Weights & Biases上查看。
EMAC+: Embodied Multimodal Agent for Collaborative Planning with VLM+LLM
Abstract
arXiv:2505.19905v1 Announce Type: new Abstract: Although LLMs demonstrate proficiency in several text-based reasoning and planning tasks, their implementation in robotics control is constrained by significant deficiencies: (1) LLM agents are designed to work mainly with textual inputs rather than visual conditions; (2) Current multimodal agents treat LLMs as static planners, which separates their reasoning from environment dynamics, resulting in actions that do not take domain-specific knowledge into account; and (3) LLMs are not designed to learn from visual interactions, which makes it harder for them to make better policies for specific domains. In this paper, we introduce EMAC+, an Embodied Multimodal Agent that collaboratively integrates LLM and VLM via a bidirectional training paradigm. Unlike existing methods, EMAC+ dynamically refines high-level textual plans generated by an LLM using real-time feedback from a VLM executing low-level visual control tasks. We address critical limitations of previous models by enabling the LLM to internalize visual environment dynamics directly through interactive experience, rather than relying solely on static symbolic mappings. Extensive experimental evaluations on ALFWorld and RT-1 benchmarks demonstrate that EMAC+ achieves superior task performance, robustness against noisy observations, and efficient learning. We also conduct thorough ablation studies and provide detailed analyses of success and failure cases.
摘要
尽管大型语言模型(LLM)在多项基于文本的推理与规划任务中展现出卓越能力,但其在机器人控制领域的应用仍存在显著局限:(1)现有LLM智能体主要设计用于处理文本输入而非视觉条件;(2)当前多模态智能体将LLM视为静态规划器,使其推理过程与环境动态分离,导致动作决策缺乏领域特异性知识;(3)LLM不具备从视觉交互中学习的能力,难以针对特定领域优化策略。本文提出EMAC+——一种通过双向训练范式协同整合LLM与视觉语言模型(VLM)的具身多模态智能体。与现有方法不同,EMAC+利用执行底层视觉控制任务的VLM实时反馈,动态优化LLM生成的高级文本规划方案。我们通过让LLM直接内化交互体验中的视觉环境动态(而非依赖静态符号映射),解决了先前模型的关键缺陷。在ALFWorld和RT-1基准测试中的大量实验表明,EMAC+在任务性能、噪声观测鲁棒性及学习效率方面均表现优异。同时,我们开展了系统的消融研究,并对成功与失败案例进行了详细分析。
DCG-SQL: Enhancing In-Context Learning for Text-to-SQL with Deep Contextual Schema Link Graph
Abstract
arXiv:2505.19956v1 Announce Type: new Abstract: Text-to-SQL, which translates a natural language question into an SQL query, has advanced with in-context learning of Large Language Models (LLMs). However, existing methods show little improvement in performance compared to randomly chosen demonstrations, and significant performance drops when smaller LLMs (e.g., Llama 3.1-8B) are used. This indicates that these methods heavily rely on the intrinsic capabilities of hyper-scaled LLMs, rather than effectively retrieving useful demonstrations. In this paper, we propose a novel approach for effectively retrieving demonstrations and generating SQL queries. We construct a Deep Contextual Schema Link Graph, which contains key information and semantic relationship between a question and its database schema items. This graph-based structure enables effective representation of Text-to-SQL samples and retrieval of useful demonstrations for in-context learning. Experimental results on the Spider benchmark demonstrate the effectiveness of our approach, showing consistent improvements in SQL generation performance and efficiency across both hyper-scaled LLMs and small LLMs. Our code will be released.
摘要
文本到SQL(Text-to-SQL)任务旨在将自然语言问题转换为SQL查询,随着大型语言模型(LLMs)的上下文学习能力提升而取得进展。然而,现有方法相比随机选择的示例在性能上改进有限,且当使用较小规模的LLMs(如Llama 3.1-8B)时会出现显著性能下降。这表明这些方法过度依赖超大规模LLMs的固有能力,而非有效检索有用的示例。本文提出一种新颖的方法,用于高效检索示例并生成SQL查询。我们构建了一种深度上下文模式链接图(Deep Contextual Schema Link Graph),其中包含问题与其数据库模式项之间的关键信息和语义关系。这种基于图的结构能够有效表示Text-to-SQL样本,并为上下文学习检索有用的示例。在Spider基准测试上的实验结果表明,我们的方法在超大规模LLMs和小规模LLMs上均能持续提升SQL生成的性能和效率。代码将公开发布。
Subtle Risks, Critical Failures: A Framework for Diagnosing Physical Safety of LLMs for Embodied Decision Making
Abstract
arXiv:2505.19933v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly used for decision making in embodied agents, yet existing safety evaluations often rely on coarse success rates and domain-specific setups, making it difficult to diagnose why and where these models fail. This obscures our understanding of embodied safety and limits the selective deployment of LLMs in high-risk physical environments. We introduce SAFEL, the framework for systematically evaluating the physical safety of LLMs in embodied decision making. SAFEL assesses two key competencies: (1) rejecting unsafe commands via the Command Refusal Test, and (2) generating safe and executable plans via the Plan Safety Test. Critically, the latter is decomposed into functional modules, goal interpretation, transition modeling, action sequencing, enabling fine-grained diagnosis of safety failures. To support this framework, we introduce EMBODYGUARD, a PDDL-grounded benchmark containing 942 LLM-generated scenarios covering both overtly malicious and contextually hazardous instructions. Evaluation across 13 state-of-the-art LLMs reveals that while models often reject clearly unsafe commands, they struggle to anticipate and mitigate subtle, situational risks. Our results highlight critical limitations in current LLMs and provide a foundation for more targeted, modular improvements in safe embodied reasoning.
摘要
大型语言模型(LLMs)在具身智能体的决策中应用日益广泛,然而现有安全评估通常依赖粗粒度的成功率指标和特定领域设置,难以诊断模型失败的具体原因及环节。这种模糊性阻碍了对具身安全性的深入理解,也限制了LLMs在高风险物理环境中的选择性部署。我们提出SAFEL框架,用于系统评估LLMs在具身决策中的物理安全性。SAFEL评估两大核心能力:(1)通过指令拒绝测试(Command Refusal Test)识别并拒绝不安全指令;(2)通过计划安全测试(Plan Safety Test)生成安全且可执行的方案。关键创新在于将后者分解为功能模块——目标解析、状态转移建模、动作序列生成,从而实现安全失效的细粒度归因。为支持该框架,我们构建了EMBODYGUARD基准测试,基于PDDL语言开发,包含942个LLM生成场景,涵盖显性恶意指令和情境性危险指令。对13个前沿LLMs的评估表明:虽然模型常能拒绝明显不安全的指令,但对潜在情境风险的预判与规避能力仍显不足。研究结果揭示了当前LLMs的关键局限,为具身推理安全性的模块化定向改进提供了理论基础。
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
Abstract
arXiv:2505.19897v1 Announce Type: new Abstract: Large Language Models (LLMs) have extended their impact beyond Natural Language Processing, substantially fostering the development of interdisciplinary research. Recently, various LLM-based agents have been developed to assist scientific discovery progress across multiple aspects and domains. Among these, computer-using agents, capable of interacting with operating systems as humans do, are paving the way to automated scientific problem-solving and addressing routines in researchers' workflows. Recognizing the transformative potential of these agents, we introduce ScienceBoard, which encompasses two complementary contributions: (i) a realistic, multi-domain environment featuring dynamic and visually rich scientific workflows with integrated professional software, where agents can autonomously interact via different interfaces to accelerate complex research tasks and experiments; and (ii) a challenging benchmark of 169 high-quality, rigorously validated real-world tasks curated by humans, spanning scientific-discovery workflows in domains such as biochemistry, astronomy, and geoinformatics. Extensive evaluations of agents with state-of-the-art backbones (e.g., GPT-4o, Claude 3.7, UI-TARS) show that, despite some promising results, they still fall short of reliably assisting scientists in complex workflows, achieving only a 15% overall success rate. In-depth analysis further provides valuable insights for addressing current agent limitations and more effective design principles, paving the way to build more capable agents for scientific discovery. Our code, environment, and benchmark are at https://qiushisun.github.io/ScienceBoard-Home/.
摘要
大语言模型(LLMs)的影响已超越自然语言处理领域,显著推动了跨学科研究的发展。近期,各类基于LLM的智能体被开发用于辅助科学发现进程,覆盖多领域多环节。其中,能够像人类一样与操作系统交互的计算机使用型智能体,正在为自动化解决科学问题及处理研究人员工作流程中的常规任务开辟道路。认识到这些智能体的变革潜力,我们推出ScienceBoard,其包含两项互补性贡献:(1)一个真实、多领域的动态可视化科学工作流环境,集成专业软件,使智能体能够通过不同界面自主交互,以加速复杂科研任务与实验;(2)一个由人类精心策划的169项高质量、严格验证的现实任务基准,涵盖生物化学、天文学、地理信息学等领域的科学发现工作流程。对采用最先进架构(如GPT-4o、Claude 3.7、UI-TARS等)的智能体进行的广泛评估表明,尽管取得部分积极成果,它们仍难以可靠辅助科学家完成复杂工作流,整体成功率仅为15%。深度分析进一步为解决当前智能体局限性及设计更有效原则提供了宝贵见解,为构建更具科学发现能力的智能体铺平道路。我们的代码、环境与基准详见https://qiushisun.github.io/ScienceBoard-Home/。
Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging
Abstract
arXiv:2505.19892v1 Announce Type: new Abstract: While foundation models update slowly due to resource-intensive training requirements, domain-specific models evolve between updates. Model merging aims to combine multiple expert models into a single, more capable model, thereby reducing storage and serving costs while supporting decentralized model development. Despite its potential, previous studies have primarily focused on merging visual classification models or Large Language Models (LLMs) for code and math tasks. Multimodal Large Language Models (MLLMs), which extend the capabilities of LLMs through large-scale multimodal training, have gained traction. However, there lacks a benchmark for model merging research that clearly divides the tasks for MLLM training and evaluation. In this paper, (i) we introduce the model merging benchmark for MLLMs, which includes multiple tasks such as VQA, Geometry, Chart, OCR, and Grounding, providing both LoRA and full fine-tuning models. Moreover, we explore how model merging can combine different modalities (e.g., vision-language, audio-language, and video-language models), moving toward the Omni-language model. (ii) We implement 10 model merging algorithms on the benchmark. Furthermore, we propose a novel method that removes noise from task vectors and robustly optimizes the merged vector based on a loss defined over task vector interactions, achieving an average performance gain of 2.48%. (iii) We find that model merging offers a promising way for building improved MLLMs without requiring data training. Our results also demonstrate that the complementarity among multiple modalities outperforms individual modalities.
摘要
由于资源密集的训练需求,基础模型更新缓慢,而领域专用模型在更新间隔期间持续演进。模型融合旨在将多个专家模型合并为单一更强能力的模型,从而降低存储与服务成本,同时支持去中心化的模型开发。尽管潜力巨大,先前研究主要集中于融合视觉分类模型或面向代码与数学任务的大语言模型(LLMs)。通过大规模多模态训练扩展LLM能力的多模态大语言模型(MLLMs)已受到广泛关注,但当前缺乏明确划分MLLM训练与评估任务的模型融合研究基准。本文中:(i)我们提出首个MLLM模型融合基准,涵盖视觉问答、几何、图表、光学字符识别和接地任务,并提供LoRA与全参数微调模型;进一步探索如何通过模型融合整合不同模态(如视觉-语言、音频-语言和视频-语言模型),向全能语言模型迈进。(ii)我们在基准上实现10种融合算法,并提出创新方法:通过消除任务向量噪声并基于任务向量交互定义的损失函数鲁棒优化合并向量,平均性能提升达2.48%。(iii)研究发现模型融合为构建更强MLLMs提供了无需数据训练的新途径,实验证实多模态间的互补性显著优于单一模态。
Adaptive Location Hierarchy Learning for Long-Tailed Mobility Prediction
Abstract
arXiv:2505.19965v1 Announce Type: new Abstract: Human mobility prediction is crucial for applications ranging from location-based recommendations to urban planning, which aims to forecast users' next location visits based on historical trajectories. Despite the severe long-tailed distribution of locations, the problem of long-tailed mobility prediction remains largely underexplored. Existing long-tailed learning methods primarily focus on rebalancing the skewed distribution at the data, model, or class level, neglecting to exploit the spatiotemporal semantics of locations. To address this gap, we propose the first plug-and-play framework for long-tailed mobility prediction in an exploitation and exploration manner, named \textbf{A}daptive \textbf{LO}cation \textbf{H}ier\textbf{A}rchy learning (ALOHA). First, we construct city-tailored location hierarchy based on Large Language Models (LLMs) by exploiting Maslow's theory of human motivation to design Chain-of-Thought (CoT) prompts that captures spatiotemporal semantics. Second, we optimize the location hierarchy predictions by Gumbel disturbance and node-wise adaptive weights within the hierarchical tree structure. Experiments on state-of-the-art models across six datasets demonstrate the framework's consistent effectiveness and generalizability, which strikes a well balance between head and tail locations. Weight analysis and ablation studies reveal the optimization differences of each component for head and tail locations. Furthermore, in-depth analyses of hierarchical distance and case study demonstrate the effective semantic guidance from the location hierarchy. Our code will be made publicly available.
摘要
人类移动预测对于从基于位置的推荐到城市规划等应用至关重要,其目标是根据历史轨迹预测用户的下一个访问位置。尽管位置数据存在严重的长尾分布,但长尾移动预测问题仍未得到充分探索。现有长尾学习方法主要关注在数据、模型或类别层面重新平衡偏态分布,而忽视了挖掘位置的时空语义。为填补这一空白,我们提出了首个即插即用的长尾移动预测框架ALOHA(自适应位置层次学习),采用开发与探索相结合的策略。首先,基于大语言模型构建城市定制化位置层次结构,利用马斯洛人类动机理论设计思维链提示以捕捉时空语义。其次,通过Gumbel扰动和层次树结构内节点自适应权重优化位置层级预测。在六个数据集上的最新模型实验表明,该框架具有持续有效性和泛化能力,在头部与尾部位置间实现了良好平衡。权重分析和消融研究揭示了各组件对头部与尾部位置的差异化优化效果。此外,层级距离的深入分析和案例研究验证了位置层次结构的有效语义引导作用。我们的代码将公开提供。
Curriculum-RLAIF: Curriculum Alignment with Reinforcement Learning from AI Feedback
Abstract
arXiv:2505.20075v1 Announce Type: new Abstract: Reward models trained with conventional Reinforcement Learning from AI Feedback (RLAIF) methods suffer from limited generalizability, which hinders the alignment performance of the policy model during reinforcement learning (RL). This challenge stems from various issues, including distribution shift, preference label noise, and mismatches between overly challenging samples and model capacity. In this paper, we attempt to enhance the generalizability of reward models through a data-centric approach, driven by the insight that these issues are inherently intertwined from the perspective of data difficulty. To address this, we propose a novel framework, \textit{Curriculum-RLAIF}, which constructs preference pairs with varying difficulty levels and produces a curriculum that progressively incorporates preference pairs of increasing difficulty for reward model training. Our experimental results suggest that reward models trained with Curriculum-RLAIF achieve improved generalizability, significantly increasing the alignment performance of the policy model by a large margin without incurring additional inference costs compared to various non-curriculum baselines. Detailed analysis and comparisons with alternative approaches, including data selection via external pretrained reward models or internal self-selection mechanisms, as well as other curriculum strategies, further demonstrate the superiority of our approach in terms of simplicity, efficiency, and effectiveness.
摘要
采用传统人工智能反馈强化学习(RLAIF)方法训练的奖励模型存在泛化能力有限的问题,这制约了强化学习(RL)过程中策略模型的对齐性能。该挑战源于多种因素,包括分布偏移、偏好标签噪声,以及高难度样本与模型能力之间的不匹配。本文尝试通过数据驱动的方法提升奖励模型的泛化能力,其核心洞见在于:从数据难度视角看,这些问题本质上是相互关联的。为此,我们提出新型框架 extit{Curriculum-RLAIF},该框架通过构建不同难度级别的偏好对,并设计渐进式融入递增难度偏好对的课程方案来训练奖励模型。实验结果表明,相较于多种非课程基线方法,采用Curriculum-RLAIF训练的奖励模型显著提升了泛化能力,在不增加额外推理成本的前提下大幅提高了策略模型的对齐性能。通过与外部预训练奖励模型的数据选择、内部自选择机制等替代方案以及其他课程策略的详细对比分析,进一步验证了本方法在简洁性、效率和有效性方面的优越性。
Automatic Metadata Extraction for Text-to-SQL
Abstract
arXiv:2505.19988v1 Announce Type: new Abstract: Large Language Models (LLMs) have recently become sophisticated enough to automate many tasks ranging from pattern finding to writing assistance to code generation. In this paper, we examine text-to-SQL generation. We have observed from decades of experience that the most difficult part of query development lies in understanding the database contents. These experiences inform the direction of our research. Text-to-SQL benchmarks such as SPIDER and Bird contain extensive metadata that is generally not available in practice. Human-generated metadata requires the use of expensive Subject Matter Experts (SMEs), who are often not fully aware of many aspects of their databases. In this paper, we explore techniques for automatic metadata extraction to enable text-to-SQL generation. Ee explore the use of two standard and one newer metadata extraction techniques: profiling, query log analysis, and SQL-to text generation using an LLM. We use BIRD benchmark [JHQY+23] to evaluate the effectiveness of these techniques. BIRD does not provide query logs on their test database, so we prepared a submission that uses profiling alone, and does not use any specially tuned model (we used GPT-4o). From Sept 1 to Sept 23, 2024, and Nov 11 through Nov 23, 2024 we achieved the highest score both with and without using the "oracle" information provided with the question set. We regained the number 1 spot on Mar 11, 2025, and are still at #1 at the time of the writing (May, 2025).
摘要
大型语言模型(LLMs)近期已发展得足够成熟,能够自动化执行从模式发现、写作辅助到代码生成等多种任务。本文重点研究文本到SQL的生成。根据我们数十年的经验观察,查询开发中最困难的部分在于理解数据库内容。这些经验为我们的研究指明了方向。
诸如SPIDER和Bird等文本到SQL基准测试包含丰富的元数据,但这些元数据在实际应用中通常不可获取。人工生成的元数据需要依赖昂贵领域专家(SMEs),而这些专家往往对其数据库的许多方面并不完全了解。本文探索了自动元数据提取技术以实现文本到SQL生成。
我们研究了两种标准技术和一种新型元数据提取方法的应用:数据画像分析、查询日志分析以及使用LLM进行SQL到文本的生成。采用BIRD基准测试[JHQY+23]评估这些技术的有效性。由于BIRD未提供测试数据库的查询日志,我们提交的方案仅使用数据画像分析,且未采用任何特别调优的模型(使用GPT-4o)。在2024年9月1日至23日及11月11日至23日期间,无论是否使用问题集提供的"oracle"信息,我们都获得了最高分数。我们于2025年3月11日重夺榜首位置,并在本文撰写时(2025年5月)仍保持第一。
Safety Through Reasoning: An Empirical Study of Reasoning Guardrail Models
Abstract
arXiv:2505.20087v1 Announce Type: new Abstract: Reasoning-based language models have demonstrated strong performance across various domains, with the most notable gains seen in mathematical and coding tasks. Recent research has shown that reasoning also offers significant benefits for LLM safety and guardrail applications. In this work, we conduct a comprehensive analysis of training reasoning-based guardrail models for content moderation, with an emphasis on generalization to custom safety policies at inference time. Our study focuses on two key dimensions: data efficiency and inference efficiency. On the data front, we find that reasoning-based models exhibit strong sample efficiency, achieving competitive performance with significantly fewer training examples than their non-reasoning counterparts. This unlocks the potential to repurpose the remaining data for mining high-value, difficult samples that further enhance model performance. On the inference side, we evaluate practical trade-offs by introducing reasoning budgets, examining the impact of reasoning length on latency and accuracy, and exploring dual-mode training to allow runtime control over reasoning behavior. Our findings will provide practical insights for researchers and developers to effectively and efficiently train and deploy reasoning-based guardrails models in real-world systems.
摘要
基于推理的语言模型在多个领域展现出卓越性能,其中数学和编程任务的提升尤为显著。近期研究表明,推理机制对大型语言模型的安全防护应用同样具有重要价值。本研究对基于推理的内容审核防护模型进行了全面分析,重点探讨其在推理阶段对自定义安全策略的泛化能力。我们从两个关键维度展开研究:数据效率与推理效率。在数据层面,发现基于推理的模型具有显著的样本效率,仅需远少于非推理模型的训练样本即可达到相当性能,这使得剩余数据可被重新用于挖掘高价值难样本以进一步提升模型表现。在推理层面,我们通过引入推理预算评估实际权衡,考察推理长度对延迟和准确率的影响,并探索双模式训练以实现运行时对推理行为的动态调控。本研究将为开发者在实际系统中高效训练和部署基于推理的防护模型提供实用指导。
Agentic AI Process Observability: Discovering Behavioral Variability
Abstract
arXiv:2505.20127v1 Announce Type: new Abstract: AI agents that leverage Large Language Models (LLMs) are increasingly becoming core building blocks of modern software systems. A wide range of frameworks is now available to support the specification of such applications. These frameworks enable the definition of agent setups using natural language prompting, which specifies the roles, goals, and tools assigned to the various agents involved. Within such setups, agent behavior is non-deterministic for any given input, highlighting the critical need for robust debugging and observability tools. In this work, we explore the use of process and causal discovery applied to agent execution trajectories as a means of enhancing developer observability. This approach aids in monitoring and understanding the emergent variability in agent behavior. Additionally, we complement this with LLM-based static analysis techniques to distinguish between intended and unintended behavioral variability. We argue that such instrumentation is essential for giving developers greater control over evolving specifications and for identifying aspects of functionality that may require more precise and explicit definitions.
摘要
基于大语言模型(LLMs)的人工智能代理正日益成为现代软件系统的核心构建模块。目前已有多种框架支持此类应用的规范定义,这些框架通过自然语言提示实现代理配置,明确指定各代理的角色、目标及分配工具。在此类配置中,代理行为对于任何给定输入均呈现非确定性特征,这凸显出强大调试与可观测性工具的关键需求。本研究探索将过程发现与因果发现技术应用于代理执行轨迹,以此增强开发者可观测性。该方法有助于监测和理解代理行为中涌现的变异性。此外,我们结合基于LLM的静态分析技术,以区分预期与非预期的行为变异。我们认为,此类工具对于提升开发者对演进规范的控制力,以及识别需要更精确明确定义的功能维度具有重要作用。
Capability-Based Scaling Laws for LLM Red-Teaming
Abstract
arXiv:2505.20162v1 Announce Type: new Abstract: As large language models grow in capability and agency, identifying vulnerabilities through red-teaming becomes vital for safe deployment. However, traditional prompt-engineering approaches may prove ineffective once red-teaming turns into a weak-to-strong problem, where target models surpass red-teamers in capabilities. To study this shift, we frame red-teaming through the lens of the capability gap between attacker and target. We evaluate more than 500 attacker-target pairs using LLM-based jailbreak attacks that mimic human red-teamers across diverse families, sizes, and capability levels. Three strong trends emerge: (i) more capable models are better attackers, (ii) attack success drops sharply once the target's capability exceeds the attacker's, and (iii) attack success rates correlate with high performance on social science splits of the MMLU-Pro benchmark. From these trends, we derive a jailbreaking scaling law that predicts attack success for a fixed target based on attacker-target capability gap. These findings suggest that fixed-capability attackers (e.g., humans) may become ineffective against future models, increasingly capable open-source models amplify risks for existing systems, and model providers must accurately measure and control models' persuasive and manipulative abilities to limit their effectiveness as attackers.
摘要
随着大语言模型能力和自主性的提升,通过红队测试识别漏洞对安全部署变得至关重要。然而,当红队测试转变为强弱对抗问题(即目标模型能力超越红队测试者)时,传统的提示工程方法可能失效。为研究这一转变,我们从攻击者与目标之间的能力差距视角重新审视红队测试框架。通过基于LLM的越狱攻击模拟人类红队测试者,我们评估了涵盖不同模型家族、规模和能力水平的500多个攻击者-目标组合,发现三个显著趋势:(i) 能力更强的模型具备更优攻击性;(ii) 当目标能力超越攻击者时,攻击成功率急剧下降;(iii) 攻击成功率与MMLU-Pro基准测试中社会科学板块的高表现呈正相关。基于这些趋势,我们推导出越狱攻击的缩放定律,可根据攻击者-目标能力差距预测固定目标的攻击成功率。这些发现表明:固定能力攻击者(如人类)可能对未来模型失效;日益强大的开源模型会放大现有系统风险;模型提供商必须精确测量并控制模型的劝说与操控能力,以限制其作为攻击者的有效性。
An Empirical Study on Strong-Weak Model Collaboration for Repo-level Code Generation
Abstract
arXiv:2505.20182v1 Announce Type: new Abstract: We study cost-efficient collaboration between strong and weak language models for repository-level code generation, where the weak model handles simpler tasks at lower cost, and the most challenging tasks are delegated to the strong model. While many works propose architectures for this task, few analyze performance relative to cost. We evaluate a broad spectrum of collaboration strategies: context-based, pipeline-based, and dynamic, on GitHub issue resolution. Our most effective collaborative strategy achieves equivalent performance to the strong model while reducing the cost by 40%. Based on our findings, we offer actionable guidelines for choosing collaboration strategies under varying budget and performance constraints. Our results show that strong-weak collaboration substantially boosts the weak model's performance at a fraction of the cost, pipeline and context-based methods being most efficient. We release the code for our work at https://github.com/shubhamrgandhi/codegen-strong-weak-collab.
摘要
我们研究了强弱语言模型在仓库级代码生成中的成本效益协作机制,其中弱模型以较低成本处理简单任务,而最具挑战性的任务则委托给强模型。尽管已有许多研究提出该任务的架构方案,但鲜有工作系统分析性能与成本的关系。我们在GitHub问题解决场景下评估了多种协作策略:基于上下文的、基于管道的以及动态策略。实验表明,最优协作策略在保持与强模型同等性能的同时可降低40%成本。基于研究发现,我们提出了在不同预算和性能约束下选择协作策略的实用指南。结果表明强弱协作能以极小成本显著提升弱模型性能,其中管道式和基于上下文的方法效率最高。本研究代码已发布于https://github.com/shubhamrgandhi/codegen-strong-weak-collab。
Program of Equations Thoughts to Solve Algebra Word Problems
Abstract
arXiv:2505.20170v1 Announce Type: new Abstract: Solving algebraic word problems (AWPs) has recently emerged as an important natural language processing task. Recently, large language models (LLMs) have demonstrated powerful mathematical capabilities, and the Chain-of-Thought technique, which guides LLMs through step-by-step reasoning, has yielded impressive results. However, this reasoning ability is limited by the computational weaknesses of LLMs themselves, where calculation errors can accumulate, leading to incorrect final answers. To address this, we propose Program of Equations Thoughts (POET), which transforms the task of generating step-by-step reasoning answers into a two-stage task of predicting equations and generating code, offloading complex computations to a Python interpreter to avoid calculation errors in LLMs. Furthermore, we propose Zero-shot POET, which utilizes a manually designed template to enable LLMs to directly generate Python code for one-step solving. Our method achieves accuracies of 95.3% and 98.0% on the PEN and ALG514 datasets, respectively, setting a new state-of-the-art (SOTA). Zero-shot POET also achieves the SOTA result of 95.5% on the DRAW-1K dataset.
摘要
解决代数应用题(AWP)近年来已成为自然语言处理领域的重要任务。当前,大语言模型(LLM)展现出强大的数学能力,而引导模型逐步推理的思维链技术已取得显著成果。然而,这种推理能力受限于LLM自身的计算缺陷——计算误差会逐步累积并导致最终答案错误。为此,我们提出方程程序思维(POET)方法,将生成逐步推理答案的任务转化为预测方程与生成代码的两阶段任务,将复杂计算卸载至Python解释器以避免LLM的计算错误。此外,我们提出零样本POET,通过人工设计模板使LLM能直接生成一步求解的Python代码。本方法在PEN和ALG514数据集上分别达到95.3%和98.0%的准确率,创造了最新最优(SOTA)结果。零样本POET在DRAW-1K数据集上也实现了95.5%的SOTA性能。
Temporal Sampling for Forgotten Reasoning in LLMs
Abstract
arXiv:2505.20196v1 Announce Type: new Abstract: Fine-tuning large language models (LLMs) is intended to improve their reasoning capabilities, yet we uncover a counterintuitive effect: models often forget how to solve problems they previously answered correctly during training. We term this phenomenon temporal forgetting and show that it is widespread across model sizes, fine-tuning methods (both Reinforcement Learning and Supervised Fine-Tuning), and multiple reasoning benchmarks. To address this gap, we introduce Temporal Sampling, a simple decoding strategy that draws outputs from multiple checkpoints along the training trajectory. This approach recovers forgotten solutions without retraining or ensembling, and leads to substantial improvements in reasoning performance, gains from 4 to 19 points in Pass@k and consistent gains in Majority@k across several benchmarks. We further extend our method to LoRA-adapted models, demonstrating that storing only adapter weights across checkpoints achieves similar benefits with minimal storage cost. By leveraging the temporal diversity inherent in training, Temporal Sampling offers a practical, compute-efficient way to surface hidden reasoning ability and rethink how we evaluate LLMs.
摘要
微调大语言模型(LLMs)旨在提升其推理能力,但我们发现了一个反直觉的现象:模型往往会遗忘训练过程中曾正确解决的问题。我们将这种现象称为"时序遗忘",并证明其普遍存在于不同模型规模、微调方法(包括强化学习和监督微调)以及多个推理基准测试中。为应对这一问题,我们提出"时序采样"——一种简单的解码策略,通过从训练轨迹中的多个检查点抽取输出来重建被遗忘的解决方案。该方法无需重新训练或集成模型,即可显著提升推理性能:在Pass@k指标上获得4至19分的提升,并在多个基准测试的Majority@k中实现持续增益。我们进一步将该方法拓展至LoRA适配模型,证明仅存储检查点间的适配器权重即可获得相似效益,且存储成本极低。通过利用训练过程中固有的时序多样性,时序采样提供了一种实用且计算高效的方法来挖掘隐藏的推理能力,并促使我们重新思考如何评估大语言模型。
Simulating Macroeconomic Expectations using LLM Agents
Abstract
arXiv:2505.17648v1 Announce Type: cross Abstract: We introduce a novel framework for simulating macroeconomic expectation formation using Large Language Model-Empowered Agents (LLM Agents). By constructing thousands of LLM Agents equipped with modules for personal characteristics, prior expectations, and knowledge, we replicate a survey experiment involving households and experts on inflation and unemployment. Our results show that although the expectations and thoughts generated by LLM Agents are more homogeneous than those of human participants, they still effectively capture key heterogeneity across agents and the underlying drivers of expectation formation. Furthermore, a module-ablation exercise highlights the critical role of prior expectations in simulating such heterogeneity. This approach complements traditional survey methods and offers new insights into AI behavioral science in macroeconomic research.
摘要
我们提出了一种利用大语言模型赋能智能体(LLM Agents)模拟宏观经济预期形成的新框架。通过构建数千个配备个人特征模块、先验预期模块和知识模块的LLM智能体,我们复现了针对家庭和专家关于通胀与失业预期的调查实验。研究结果表明:尽管LLM智能体生成的预期和观点比人类参与者更具同质性,但仍能有效捕捉不同智能体间的关键异质性以及预期形成的深层驱动因素。进一步的模块消融实验突显了先验预期在模拟此类异质性中的核心作用。该方法不仅是对传统调查手段的重要补充,更为宏观经济研究中的人工智能行为科学提供了新的研究视角。
InjectLab: A Tactical Framework for Adversarial Threat Modeling Against Large Language Models
Abstract
arXiv:2505.18156v1 Announce Type: cross Abstract: Large Language Models (LLMs) are changing the way people interact with technology. Tools like ChatGPT and Claude AI are now common in business, research, and everyday life. But with that growth comes new risks, especially prompt-based attacks that exploit how these models process language. InjectLab is a security framework designed to address that problem. This paper introduces InjectLab as a structured, open-source matrix that maps real-world techniques used to manipulate LLMs. The framework is inspired by MITRE ATT&CK and focuses specifically on adversarial behavior at the prompt layer. It includes over 25 techniques organized under six core tactics, covering threats like instruction override, identity swapping, and multi-agent exploitation. Each technique in InjectLab includes detection guidance, mitigation strategies, and YAML-based simulation tests. A Python tool supports easy execution of prompt-based test cases. This paper outlines the framework's structure, compares it to other AI threat taxonomies, and discusses its future direction as a practical, community-driven foundation for securing language models.
摘要
大型语言模型(LLMs)正在改变人们与技术交互的方式。诸如ChatGPT和Claude AI等工具现已广泛应用于商业、研究和日常生活。然而,随着其发展,新的风险也随之而来,尤其是利用这些模型语言处理机制的提示型攻击。InjectLab是一个旨在解决该问题的安全框架。本文介绍InjectLab作为一种结构化、开源矩阵,用于映射现实世界中操纵LLMs的技术。该框架受MITRE ATT&CK启发,特别关注提示层的对抗行为,包含六大核心策略下的25种以上技术,涵盖指令覆盖、身份切换和多智能体利用等威胁。InjectLab中的每种技术均包含检测指南、缓解策略及基于YAML的模拟测试,并配备Python工具以支持便捷执行提示型测试用例。本文概述了该框架的结构,将其与其他AI威胁分类法进行比较,并探讨其作为保护语言模型的实践性、社区驱动基础框架的未来发展方向。
On Path to Multimodal Historical Reasoning: HistBench and HistAgent
Abstract
arXiv:2505.20246v1 Announce Type: new Abstract: Recent advances in large language models (LLMs) have led to remarkable progress across domains, yet their capabilities in the humanities, particularly history, remain underexplored. Historical reasoning poses unique challenges for AI, involving multimodal source interpretation, temporal inference, and cross-linguistic analysis. While general-purpose agents perform well on many existing benchmarks, they lack the domain-specific expertise required to engage with historical materials and questions. To address this gap, we introduce HistBench, a new benchmark of 414 high-quality questions designed to evaluate AI's capacity for historical reasoning and authored by more than 40 expert contributors. The tasks span a wide range of historical problems-from factual retrieval based on primary sources to interpretive analysis of manuscripts and images, to interdisciplinary challenges involving archaeology, linguistics, or cultural history. Furthermore, the benchmark dataset spans 29 ancient and modern languages and covers a wide range of historical periods and world regions. Finding the poor performance of LLMs and other agents on HistBench, we further present HistAgent, a history-specific agent equipped with carefully designed tools for OCR, translation, archival search, and image understanding in History. On HistBench, HistAgent based on GPT-4o achieves an accuracy of 27.54% pass@1 and 36.47% pass@2, significantly outperforming LLMs with online search and generalist agents, including GPT-4o (18.60%), DeepSeek-R1(14.49%) and Open Deep Research-smolagents(20.29% pass@1 and 25.12% pass@2). These results highlight the limitations of existing LLMs and generalist agents and demonstrate the advantages of HistAgent for historical reasoning.
摘要
尽管大语言模型(LLMs)的最新进展在各领域取得了显著成就,但其在人文学科尤其是历史学中的能力仍待深入探索。历史推理对人工智能提出了独特挑战,涉及多模态史料解读、时序推理及跨语言分析。虽然通用智能体在现有基准测试中表现良好,但它们缺乏处理历史材料和问题所需的领域专业知识。为填补这一空白,我们推出了HistBench——一个包含414道高质量问题的全新基准测试,由40余位专家共同设计,旨在评估AI的历史推理能力。这些任务涵盖广泛的历史问题,包括基于原始史实检索、手稿与图像的诠释分析,以及涉及考古学、语言学或文化史等跨学科挑战。此外,该基准数据集涵盖29种古今语言,跨越多个历史时期和世界区域。针对LLMs及其他智能体在HistBench上的较差表现,我们进一步提出HistAgent——一个专为历史研究设计的智能体,配备精心构建的OCR、翻译、档案检索和图像理解工具。基于GPT-4o的HistAgent在HistBench上取得了27.54%的pass@1准确率和36.47%的pass@2准确率,显著优于具备在线搜索功能的LLMs及通用智能体(包括GPT-4o的18.60%、DeepSeek-R1的14.49%以及Open Deep Research-smolagents的20.29% pass@1和25.12% pass@2)。这些结果既揭示了现有LLMs与通用智能体的局限性,也验证了HistAgent在历史推理中的优势。
Model-Distributed Inference for Large Language Models at the Edge
Abstract
arXiv:2505.18164v1 Announce Type: cross Abstract: We introduce Model-Distributed Inference for Large-Language Models (MDI-LLM), a novel framework designed to facilitate the deployment of state-of-the-art large-language models (LLMs) across low-power devices at the edge. This is accomplished by dividing the model into multiple partitions, which are then assigned to different devices/nodes within the network. These nodes exchange intermediate activation vectors via device-to-device links, enabling collaborative computation. To enhance the efficiency of this process, we propose the "recurrent pipeline parallelism" technique, which reduces idle time on each device and facilitates parallel inference during the generation of multiple text sequences. By leveraging the combined computational resources of multiple edge devices, MDI-LLM enables the deployment of LLMs that exceed the memory capacity of individual devices, making it possible to perform inference on low-cost hardware. Furthermore, as the number of participating devices increases, MDI-LLM boosts token generation throughput and reduces memory consumption per device.
摘要
我们提出了一种面向大语言模型的模型分布式推理框架(MDI-LLM),该创新框架旨在促进最先进的大语言模型在边缘低功耗设备上的部署。该框架通过将模型划分为多个分区,并将其分配到网络中的不同设备/节点来实现。这些节点通过设备间链路交换中间激活向量,从而实现协同计算。为提高该过程的效率,我们提出了"循环流水线并行"技术,该技术可减少每个设备的空闲时间,并在生成多个文本序列时实现并行推理。通过利用多个边缘设备的组合计算资源,MDI-LLM能够部署超出单个设备内存容量的大语言模型,使得在低成本硬件上执行推理成为可能。此外,随着参与设备数量的增加,MDI-LLM可提升令牌生成吞吐量并降低每个设备的内存消耗。
syftr: Pareto-Optimal Generative AI
Abstract
arXiv:2505.20266v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) pipelines are central to applying large language models (LLMs) to proprietary or dynamic data. However, building effective RAG flows is complex, requiring careful selection among vector databases, embedding models, text splitters, retrievers, and synthesizing LLMs. The challenge deepens with the rise of agentic paradigms. Modules like verifiers, rewriters, and rerankers-each with intricate hyperparameter dependencies have to be carefully tuned. Balancing tradeoffs between latency, accuracy, and cost becomes increasingly difficult in performance-sensitive applications. We introduce syftr, a framework that performs efficient multi-objective search over a broad space of agentic and non-agentic RAG configurations. Using Bayesian Optimization, syftr discovers Pareto-optimal flows that jointly optimize task accuracy and cost. A novel early-stopping mechanism further improves efficiency by pruning clearly suboptimal candidates. Across multiple RAG benchmarks, syftr finds flows which are on average approximately 9 times cheaper while preserving most of the accuracy of the most accurate flows on the Pareto-frontier. Furthermore, syftr's ability to design and optimize allows integrating new modules, making it even easier and faster to realize high-performing generative AI pipelines.
摘要
检索增强生成(RAG)流程是将大语言模型(LLMs)应用于专有或动态数据的核心。然而,构建高效的RAG流程十分复杂,需要在向量数据库、嵌入模型、文本分割器、检索器和合成LLMs之间进行谨慎选择。随着代理范式的兴起,这一挑战进一步加深。验证器、改写器和重排序器等模块——每个模块都具有复杂的超参数依赖关系——必须仔细调优。在性能敏感的应用中,平衡延迟、准确性和成本之间的权衡变得越来越困难。我们提出了syftr框架,该框架能在广泛的代理和非代理RAG配置空间中进行高效的多目标搜索。通过贝叶斯优化,syftr发现了能同时优化任务准确性和成本的帕累托最优流程。一种新颖的早期停止机制通过剪枝明显次优的候选方案,进一步提高了效率。在多个RAG基准测试中,syftr发现的流程平均比帕累托前沿上最准确的流程便宜约9倍,同时保留了其大部分准确性。此外,syftr的设计和优化能力允许集成新模块,使得实现高性能生成式AI流程更加便捷和快速。
Alita: Generalist Agent Enabling Scalable Agentic Reasoning with Minimal Predefinition and Maximal Self-Evolution
Abstract
arXiv:2505.20286v1 Announce Type: new Abstract: Recent advances in large language models (LLMs) have enabled agents to autonomously perform complex, open-ended tasks. However, many existing frameworks depend heavily on manually predefined tools and workflows, which hinder their adaptability, scalability, and generalization across domains. In this work, we introduce Alita--a generalist agent designed with the principle of "Simplicity is the ultimate sophistication," enabling scalable agentic reasoning through minimal predefinition and maximal self-evolution. For minimal predefinition, Alita is equipped with only one component for direct problem-solving, making it much simpler and neater than previous approaches that relied heavily on hand-crafted, elaborate tools and workflows. This clean design enhances its potential to generalize to challenging questions, without being limited by tools. For Maximal self-evolution, we enable the creativity of Alita by providing a suite of general-purpose components to autonomously construct, refine, and reuse external capabilities by generating task-related model context protocols (MCPs) from open source, which contributes to scalable agentic reasoning. Notably, Alita achieves 75.15% pass@1 and 87.27% pass@3 accuracy, which is top-ranking among general-purpose agents, on the GAIA benchmark validation dataset, 74.00% and 52.00% pass@1, respectively, on Mathvista and PathVQA, outperforming many agent systems with far greater complexity. More details will be updated at \href{https://github.com/CharlesQ9/Alita}{https://github.com/CharlesQ9/Alita}.
摘要
大语言模型(LLM)的最新进展使得智能体能够自主执行复杂的开放式任务。然而,现有框架大多严重依赖手动预定义工具与工作流,这限制了其跨领域的适应性、可扩展性和泛化能力。本研究提出Alita——一个遵循"至简即至臻"原则设计的通用智能体,通过最小化预定义与最大化自我进化实现可扩展的自主推理。在最小化预定义方面,Alita仅配备单一直接问题解决组件,相比依赖大量手工构建复杂工具链的现有方案更为简洁。这种简洁设计增强了其应对挑战性问题的泛化潜力,不受工具限制。在最大化自我进化方面,我们通过开源模型上下文协议(MCP)生成机制,提供通用组件套件使智能体能自主构建、优化和复用外部能力,从而激发创造力并实现可扩展的自主推理。值得注意的是,Alita在GAIA基准验证集上达到75.15% pass@1和87.27% pass@3准确率,位列通用智能体榜首;在Mathvista和PathVQA上分别取得74.00%和52.00% pass@1,性能超越许多复杂度更高的智能体系统。更多细节将持续更新于https://github.com/CharlesQ9/Alita。
LA-RCS: LLM-Agent-Based Robot Control System
Abstract
arXiv:2505.18214v1 Announce Type: cross Abstract: LA-RCS (LLM-agent-based robot control system) is a sophisticated robot control system designed to autonomously plan, work, and analyze the external environment based on user requirements by utilizing LLM-Agent. Utilizing a dual-agent framework, LA-RCS generates plans based on user requests, observes the external environment, executes the plans, and modifies the plans as needed to adapt to changes in the external conditions. Additionally, LA-RCS interprets natural language commands by the user and converts them into commands compatible with the robot interface so that the robot can execute tasks and meet user requests properly. During his process, the system autonomously evaluates observation results, provides feedback on the tasks, and executes commands based on real-time environmental monitoring, significantly reducing the need for user intervention in fulfilling requests. We categorized the scenarios that LA-RCS needs to perform into four distinct types and conducted a quantitative assessment of its performance in each scenario. The results showed an average success rate of 90 percent, demonstrating the system capability to fulfill user requests satisfactorily. For more extensive results, readers can visit our project page: https://la-rcs.github.io
Towards medical AI misalignment: a preliminary study
Abstract
arXiv:2505.18212v1 Announce Type: cross Abstract: Despite their staggering capabilities as assistant tools, often exceeding human performances, Large Language Models (LLMs) are still prone to jailbreak attempts from malevolent users. Although red teaming practices have already identified and helped to address several such jailbreak techniques, one particular sturdy approach involving role-playing (which we named `Goofy Game') seems effective against most of the current LLMs safeguards. This can result in the provision of unsafe content, which, although not harmful per se, might lead to dangerous consequences if delivered in a setting such as the medical domain. In this preliminary and exploratory study, we provide an initial analysis of how, even without technical knowledge of the internal architecture and parameters of generative AI models, a malicious user could construct a role-playing prompt capable of coercing an LLM into producing incorrect (and potentially harmful) clinical suggestions. We aim to illustrate a specific vulnerability scenario, providing insights that can support future advancements in the field.
摘要
尽管大型语言模型(LLM)作为辅助工具展现出惊人能力且常超越人类表现,但其仍易受到恶意用户的越狱攻击。虽然红队测试已识别并协助修复了多种此类越狱技术,但一种名为"滑稽游戏"的角色扮演方法表现出特殊鲁棒性,能有效突破当前大多数LLM防护机制。这可能导致模型输出不安全内容——尽管内容本身无害,但若应用于医疗等领域则可能引发严重后果。在本探索性初步研究中,我们首次分析了恶意用户如何在无需了解生成式AI模型内部架构与技术参数的情况下,通过构建角色扮演提示词迫使LLM生成错误(且具潜在危害性)的临床建议。本研究旨在揭示特定漏洞场景,为未来该领域的安全防护研究提供理论依据。
ABHINAYA -- A System for Speech Emotion Recognition In Naturalistic Conditions Challenge
Abstract
arXiv:2505.18217v1 Announce Type: cross Abstract: Speech emotion recognition (SER) in naturalistic settings remains a challenge due to the intrinsic variability, diverse recording conditions, and class imbalance. As participants in the Interspeech Naturalistic SER Challenge which focused on these complexities, we present Abhinaya, a system integrating speech-based, text-based, and speech-text models. Our approach fine-tunes self-supervised and speech large language models (SLLM) for speech representations, leverages large language models (LLM) for textual context, and employs speech-text modeling with an SLLM to capture nuanced emotional cues. To combat class imbalance, we apply tailored loss functions and generate categorical decisions through majority voting. Despite one model not being fully trained, the Abhinaya system ranked 4th among 166 submissions. Upon completion of training, it achieved state-of-the-art performance among published results, demonstrating the effectiveness of our approach for SER in real-world conditions.
摘要
自然场景下的语音情感识别(SER)由于内在变异性、多样化的录音条件以及类别不平衡等问题,仍面临挑战。作为聚焦这些复杂性的Interspeech自然场景SER挑战赛参赛者,我们提出Abhinaya系统,该系统整合了基于语音、文本及语音-文本的模型。我们的方法通过微调自监督语音大语言模型(SLLM)获取语音表征,利用大语言模型(LLM)提取文本上下文,并采用SLLM进行语音-文本建模以捕捉细微情感线索。为应对类别不平衡,我们应用定制化损失函数并通过多数投票生成分类决策。尽管其中一个模型未完全训练,Abhinaya系统仍在166份提交中排名第4。在完成训练后,该系统在已发表成果中达到了最先进的性能,证明了我们提出的方法在真实场景SER任务中的有效性。
Large Language Model-Driven Distributed Integrated Multimodal Sensing and Semantic Communications
Abstract
arXiv:2505.18194v1 Announce Type: cross Abstract: Traditional single-modal sensing systems-based solely on either radio frequency (RF) or visual data-struggle to cope with the demands of complex and dynamic environments. Furthermore, single-device systems are constrained by limited perspectives and insufficient spatial coverage, which impairs their effectiveness in urban or non-line-of-sight scenarios. To overcome these challenges, we propose a novel large language model (LLM)-driven distributed integrated multimodal sensing and semantic communication (LLM-DiSAC) framework. Specifically, our system consists of multiple collaborative sensing devices equipped with RF and camera modules, working together with an aggregation center to enhance sensing accuracy. First, on sensing devices, LLM-DiSAC develops an RF-vision fusion network (RVFN), which employs specialized feature extractors for RF and visual data, followed by a cross-attention module for effective multimodal integration. Second, a LLM-based semantic transmission network (LSTN) is proposed to enhance communication efficiency, where the LLM-based decoder leverages known channel parameters, such as transceiver distance and signal-to-noise ratio (SNR), to mitigate semantic distortion. Third, at the aggregation center, a transformer-based aggregation model (TRAM) with an adaptive aggregation attention mechanism is developed to fuse distributed features and enhance sensing accuracy. To preserve data privacy, a two-stage distributed learning strategy is introduced, allowing local model training at the device level and centralized aggregation model training using intermediate features. Finally, evaluations on a synthetic multi-view RF-visual dataset generated by the Genesis simulation engine show that LLM-DiSAC achieves a good performance.
摘要
传统基于单一模态感知系统(仅依赖射频或视觉数据)难以应对复杂动态环境的需求。此外,单设备系统受限于视角狭窄和空间覆盖不足,在城区或非视距场景中效能受限。为突破这些限制,我们提出一种新型大语言模型驱动的分布式集成多模态感知与语义通信框架(LLM-DiSAC)。该系统由多个配备射频与摄像模块的协同感知设备构成,通过与汇聚中心协作提升感知精度。首先,在感知设备端,LLM-DiSAC开发了射频-视觉融合网络(RVFN),采用专用特征提取器处理射频与视觉数据,并通过交叉注意力模块实现高效多模态融合。其次,提出基于大语言模型的语义传输网络(LSTN)以提升通信效率,其中基于大语言模型的解码器利用收发距离、信噪比等已知信道参数来抑制语义失真。第三,在汇聚中心端开发了具有自适应聚合注意力机制的Transformer聚合模型(TRAM),用于融合分布式特征并提升感知精度。为保护数据隐私,采用两阶段分布式学习策略:在设备端进行本地模型训练,同时利用中间特征进行集中式聚合模型训练。最终,基于Genesis仿真引擎生成的合成多视角射频-视觉数据集验证表明,LLM-DiSAC实现了优越性能。
CoMet: Metaphor-Driven Covert Communication for Multi-Agent Language Games
Abstract
arXiv:2505.18218v1 Announce Type: cross Abstract: Metaphors are a crucial way for humans to express complex or subtle ideas by comparing one concept to another, often from a different domain. However, many large language models (LLMs) struggle to interpret and apply metaphors in multi-agent language games, hindering their ability to engage in covert communication and semantic evasion, which are crucial for strategic communication. To address this challenge, we introduce CoMet, a framework that enables LLM-based agents to engage in metaphor processing. CoMet combines a hypothesis-based metaphor reasoner with a metaphor generator that improves through self-reflection and knowledge integration. This enhances the agents' ability to interpret and apply metaphors, improving the strategic and nuanced quality of their interactions. We evaluate CoMet on two multi-agent language games - Undercover and Adversarial Taboo - which emphasize Covert Communication and Semantic Evasion. Experimental results demonstrate that CoMet significantly enhances the agents' ability to communicate strategically using metaphors.
摘要
隐喻是人类通过将一个概念与另一领域的概念相比较来表达复杂或微妙思想的重要手段。然而,许多大语言模型(LLM)在多智能体语言游戏中难以理解和运用隐喻,这阻碍了其进行隐蔽交流和语义规避的能力,而这些能力对策略性沟通至关重要。为解决这一挑战,我们提出了CoMet框架,使基于LLM的智能体能够进行隐喻处理。CoMet将基于假设的隐喻推理器与通过自我反思和知识整合改进的隐喻生成器相结合,从而增强了智能体解释和应用隐喻的能力,提升了其交互的策略性和微妙性。我们在两款侧重隐蔽交流与语义规避的多智能体语言游戏——《Undercover》和《Adversarial Taboo》上评估了CoMet。实验结果表明,CoMet显著提升了智能体使用隐喻进行策略性沟通的能力。
Do BERT-Like Bidirectional Models Still Perform Better on Text Classification in the Era of LLMs?
Abstract
arXiv:2505.18215v1 Announce Type: cross Abstract: The rapid adoption of LLMs has overshadowed the potential advantages of traditional BERT-like models in text classification. This study challenges the prevailing "LLM-centric" trend by systematically comparing three category methods, i.e., BERT-like models fine-tuning, LLM internal state utilization, and zero-shot inference across six high-difficulty datasets. Our findings reveal that BERT-like models often outperform LLMs. We further categorize datasets into three types, perform PCA and probing experiments, and identify task-specific model strengths: BERT-like models excel in pattern-driven tasks, while LLMs dominate those requiring deep semantics or world knowledge. Based on this, we propose TaMAS, a fine-grained task selection strategy, advocating for a nuanced, task-driven approach over a one-size-fits-all reliance on LLMs.
摘要
大型语言模型(LLM)的快速普及掩盖了传统BERT类模型在文本分类中的潜在优势。本研究通过系统比较三类方法(即BERT类模型微调、LLM内部状态利用和零样本推理)在六个高难度数据集上的表现,对当前"以LLM为中心"的主流趋势提出挑战。实验结果表明,BERT类模型往往优于LLMs。我们进一步将数据集划分为三种类型,进行主成分分析和探测实验,发现任务特异性模型优势:BERT类模型擅长模式驱动型任务,而LLMs在需要深度语义或世界知识的任务中表现更优。基于此,我们提出细粒度任务选择策略TaMAS,倡导根据具体任务特性选择模型,而非盲目依赖LLMs的"一刀切"方案。
IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis
Abstract
arXiv:2505.18223v1 Announce Type: cross Abstract: Large Language Models (LLMs) show promise as data analysis agents, but existing benchmarks overlook the iterative nature of the field, where experts' decisions evolve with deeper insights of the dataset. To address this, we introduce IDA-Bench, a novel benchmark evaluating LLM agents in multi-round interactive scenarios. Derived from complex Kaggle notebooks, tasks are presented as sequential natural language instructions by an LLM-simulated user. Agent performance is judged by comparing its final numerical output to the human-derived baseline. Initial results show that even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on < 50% of the tasks, highlighting limitations not evident in single-turn tests. This work underscores the need to improve LLMs' multi-round capabilities for building more reliable data analysis agents, highlighting the necessity of achieving a balance between instruction following and reasoning.
摘要
大语言模型(LLMs)作为数据分析代理展现出潜力,但现有基准测试忽视了该领域的迭代特性——专家的决策会随着对数据集理解的深入而演变。为此,我们提出IDA-Bench,这是一个评估LLM代理在多轮交互场景中表现的新型基准。该基准源自复杂的Kaggle笔记本,任务以LLM模拟用户发出的序列化自然语言指令形式呈现。代理性能通过将其最终数值输出与人工基准进行对比来评判。初步结果显示,即使是Claude-3.7-thinking等最先进的编码代理,其任务成功率也低于50%,这揭示了单轮测试中无法体现的局限性。本研究强调需提升LLMs的多轮交互能力以构建更可靠的数据分析代理,同时指出必须在指令遵循与推理能力之间取得平衡。
Navigating Pitfalls: Evaluating LLMs in Machine Learning Programming Education
Abstract
arXiv:2505.18220v1 Announce Type: cross Abstract: The rapid advancement of Large Language Models (LLMs) has opened new avenues in education. This study examines the use of LLMs in supporting learning in machine learning education; in particular, it focuses on the ability of LLMs to identify common errors of practice (pitfalls) in machine learning code, and their ability to provide feedback that can guide learning. Using a portfolio of code samples, we consider four different LLMs: one closed model and three open models. Whilst the most basic pitfalls are readily identified by all models, many common pitfalls are not. They particularly struggle to identify pitfalls in the early stages of the ML pipeline, especially those which can lead to information leaks, a major source of failure within applied ML projects. They also exhibit limited success at identifying pitfalls around model selection, which is a concept that students often struggle with when first transitioning from theory to practice. This questions the use of current LLMs to support machine learning education, and also raises important questions about their use by novice practitioners. Nevertheless, when LLMs successfully identify pitfalls in code, they do provide feedback that includes advice on how to proceed, emphasising their potential role in guiding learners. We also compare the capability of closed and open LLM models, and find that the gap is relatively small given the large difference in model sizes. This presents an opportunity to deploy, and potentially customise, smaller more efficient LLM models within education, avoiding risks around cost and data sharing associated with commercial models.
摘要
大型语言模型(LLMs)的快速发展为教育领域开辟了新途径。本研究探讨了LLMs在机器学习教育中支持学习的应用,重点关注其识别机器学习代码中常见实践错误(陷阱)的能力,以及提供学习指导反馈的能力。通过一组代码样本组合,我们评估了四种不同LLMs:一种闭源模型和三种开源模型。虽然所有模型都能轻松识别最基本的陷阱,但对许多常见陷阱却无法识别。这些模型尤其难以识别机器学习流程早期阶段的陷阱,特别是可能导致信息泄露的陷阱——这是应用机器学习项目失败的主要根源。此外,在模型选择相关的陷阱识别上,这些模型表现有限,而该概念正是学生从理论转向实践时经常遇到的难点。这对当前LLMs支持机器学习教育的适用性提出了质疑,同时也引发了关于新手从业者使用这些模型的重要问题。然而,当LLMs成功识别代码中的陷阱时,其反馈确实包含后续操作建议,凸显了其在学习引导方面的潜在作用。我们还比较了闭源与开源LLM模型的能力,发现尽管模型规模差异显著,但性能差距相对较小。这为在教育领域部署(并可能定制)更高效的小型LLM模型提供了机遇,同时规避了商业模型在成本和数据共享方面的风险。
Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality
Abstract
arXiv:2505.18227v1 Announce Type: cross Abstract: In Transformer architectures, tokens\textemdash discrete units derived from raw data\textemdash are formed by segmenting inputs into fixed-length chunks. Each token is then mapped to an embedding, enabling parallel attention computations while preserving the input's essential information. Due to the quadratic computational complexity of transformer self-attention mechanisms, token reduction has primarily been used as an efficiency strategy. This is especially true in single vision and language domains, where it helps balance computational costs, memory usage, and inference latency. Despite these advances, this paper argues that token reduction should transcend its traditional efficiency-oriented role in the era of large generative models. Instead, we position it as a fundamental principle in generative modeling, critically influencing both model architecture and broader applications. Specifically, we contend that across vision, language, and multimodal systems, token reduction can: (i) facilitate deeper multimodal integration and alignment, (ii) mitigate "overthinking" and hallucinations, (iii) maintain coherence over long inputs, and (iv) enhance training stability, etc. We reframe token reduction as more than an efficiency measure. By doing so, we outline promising future directions, including algorithm design, reinforcement learning-guided token reduction, token optimization for in-context learning, and broader ML and scientific domains. We highlight its potential to drive new model architectures and learning strategies that improve robustness, increase interpretability, and better align with the objectives of generative modeling.
摘要
在Transformer架构中,通过将输入分割为固定长度块来形成标记——这些从原始数据中提取的离散单元。每个标记随后被映射为嵌入表示,从而在保留输入核心信息的同时实现并行注意力计算。由于Transformer自注意力机制具有二次计算复杂度,标记缩减技术主要被用作效率优化策略,这在单模态视觉和语言领域尤为明显,因其有助于平衡计算成本、内存占用和推理延迟。尽管已有这些进展,本文主张在大模型时代,标记缩减应当超越其传统效率导向的角色。我们将其重新定位为生成建模的基础原则,认为其对模型架构和更广泛的应用具有关键影响。具体而言,我们论证了在视觉、语言和多模态系统中,标记缩减能够:(i)促进更深层次的多模态融合与对齐;(ii)缓解'过度思考'和幻觉现象;(iii)保持长输入序列的连贯性;(iv)提升训练稳定性等。我们将标记重新定义为超越效率优化的核心要素,并据此勾勒出未来研究方向,包括算法设计、强化学习引导的标记缩减、上下文学习中的标记优化,以及更广泛的机器学习和科学领域应用。我们强调该技术有望推动新型模型架构和学习策略的发展,从而提升模型鲁棒性、增强可解释性,并更好地契合生成建模的目标。
NSNQuant: A Double Normalization Approach for Calibration-Free Low-Bit Vector Quantization of KV Cache
Abstract
arXiv:2505.18231v1 Announce Type: cross Abstract: Large Language Model (LLM) inference is typically memory-intensive, especially when processing large batch sizes and long sequences, due to the large size of key-value (KV) cache. Vector Quantization (VQ) is recently adopted to alleviate this issue, but we find that the existing approach is susceptible to distribution shift due to its reliance on calibration datasets. To address this limitation, we introduce NSNQuant, a calibration-free Vector Quantization (VQ) technique designed for low-bit compression of the KV cache. By applying a three-step transformation-1) a token-wise normalization (Normalize), 2) a channel-wise centering (Shift), and 3) a second token-wise normalization (Normalize)-with Hadamard transform, NSNQuant effectively aligns the token distribution with the standard normal distribution. This alignment enables robust, calibration-free vector quantization using a single reusable codebook. Extensive experiments show that NSNQuant consistently outperforms prior methods in both 1-bit and 2-bit settings, offering strong generalization and up to 3 throughput gain over full-precision baselines.
摘要
大语言模型(LLM)推理过程通常具有较高的内存需求,尤其是在处理大批量数据和长序列时,关键值(KV)缓存的大容量是主要原因。向量量化(VQ)技术近期被用于缓解这一问题,但我们发现现有方法因依赖校准数据集而易受分布偏移影响。为克服这一局限,本文提出NSNQuant——一种无需校准的向量量化技术,专为KV缓存的低位压缩设计。该方法通过三步变换(1)词元级归一化(Normalize)、(2)通道级中心化(Shift)、(3)二次词元级归一化(Normalize)结合Hadamard变换,将词元分布有效对齐标准正态分布。这种对齐方式实现了基于单一可复用码本的稳健、免校准向量量化。大量实验表明,NSNQuant在1比特和2比特设置下均优于现有方法,展现出强泛化能力,相比全精度基线最高可获得3倍吞吐量提升。
Taming LLMs with Negative Samples: A Reference-Free Framework to Evaluate Presentation Content with Actionable Feedback
Abstract
arXiv:2505.18240v1 Announce Type: cross Abstract: The generation of presentation slides automatically is an important problem in the era of generative AI. This paper focuses on evaluating multimodal content in presentation slides that can effectively summarize a document and convey concepts to a broad audience. We introduce a benchmark dataset, RefSlides, consisting of human-made high-quality presentations that span various topics. Next, we propose a set of metrics to characterize different intrinsic properties of the content of a presentation and present REFLEX, an evaluation approach that generates scores and actionable feedback for these metrics. We achieve this by generating negative presentation samples with different degrees of metric-specific perturbations and use them to fine-tune LLMs. This reference-free evaluation technique does not require ground truth presentations during inference. Our extensive automated and human experiments demonstrate that our evaluation approach outperforms classical heuristic-based and state-of-the-art large language model-based evaluations in generating scores and explanations.
摘要
在生成式人工智能时代,自动生成演示文稿幻灯片是一个重要课题。本文重点评估能够有效总结文档内容并向广泛受众传递概念的多模态演示文稿内容。我们引入了一个基准数据集RefSlides,该数据集包含涵盖多个主题的人工制作高质量演示文稿。接着,我们提出一组用于表征演示文稿内容不同内在特性的指标,并提出了REFLEX评估方法——该方法能针对这些指标生成评分和可操作的反馈。我们通过生成具有不同程度指标特异性扰动的负面演示样本,并利用这些样本来微调大语言模型,从而实现这一目标。这种无参考评估技术在推理过程中不需要真实演示文稿作为基准。大量自动化及人工实验表明,我们的评估方法在生成评分和解释方面优于传统的基于启发式方法和最先进的大语言模型评估方法。
The Origins of Representation Manifolds in Large Language Models
Abstract
arXiv:2505.18235v1 Announce Type: cross Abstract: There is a large ongoing scientific effort in mechanistic interpretability to map embeddings and internal representations of AI systems into human-understandable concepts. A key element of this effort is the linear representation hypothesis, which posits that neural representations are sparse linear combinations of `almost-orthogonal' direction vectors, reflecting the presence or absence of different features. This model underpins the use of sparse autoencoders to recover features from representations. Moving towards a fuller model of features, in which neural representations could encode not just the presence but also a potentially continuous and multidimensional value for a feature, has been a subject of intense recent discourse. We describe why and how a feature might be represented as a manifold, demonstrating in particular that cosine similarity in representation space may encode the intrinsic geometry of a feature through shortest, on-manifold paths, potentially answering the question of how distance in representation space and relatedness in concept space could be connected. The critical assumptions and predictions of the theory are validated on text embeddings and token activations of large language models.
摘要
机制可解释性研究领域正致力于将人工智能系统的嵌入和内部表征映射为人类可理解的概念,其中线性表征假说是该研究的核心要素。该假说认为神经表征是由"近似正交"的方向向量构成的稀疏线性组合,反映不同特征的存在与否。这一模型支撑了使用稀疏自编码器从表征中恢复特征的方法。近期学界热议的焦点是构建更完整的特征模型,使神经表征不仅能编码特征的存在性,还能表达特征的潜在连续多维取值。本文阐述了特征为何及如何被表征为流形,特别论证了表征空间中的余弦相似性可能通过流形上的最短路径编码特征的内在几何结构,这或许能解释表征空间距离与概念空间关联性之间的联系。该理论的关键假设和预测在大型语言模型的文本嵌入与标记激活上得到了验证。
ELDeR: Getting Efficient LLMs through Data-Driven Regularized Layer-wise Pruning
Abstract
arXiv:2505.18232v1 Announce Type: cross Abstract: The deployment of Large language models (LLMs) in many fields is largely hindered by their high computational and memory costs. Recent studies suggest that LLMs exhibit sparsity, which can be used for pruning. Previous pruning methods typically follow a prune-then-finetune paradigm. Since the pruned parts still contain valuable information, statically removing them without updating the remaining parameters often results in irreversible performance degradation, requiring costly recovery fine-tuning (RFT) to maintain performance. To address this, we propose a novel paradigm: first apply regularization, then prune. Based on this paradigm, we propose ELDeR: Getting Efficient LLMs through Data-Driven Regularized Layer-wise Pruning. We multiply the output of each transformer layer by an initial weight, then we iteratively learn the weights of each transformer layer by using a small amount of data in a simple way. After that, we apply regularization to the difference between the output and input of the layers with smaller weights, forcing the information to be transferred to the remaining layers. Compared with direct pruning, ELDeR reduces the information loss caused by direct parameter removal, thus better preserving the model's language modeling ability. Experimental results show that ELDeR achieves superior performance compared with powerful layer-wise structured pruning methods, while greatly reducing RFT computational costs. Since ELDeR is a layer-wise pruning method, its end-to-end acceleration effect is obvious, making it a promising technique for efficient LLMs.
摘要
大型语言模型(LLMs)在许多领域的应用因其高昂的计算和内存成本而受到严重制约。近期研究表明,LLMs具有稀疏性特征,这一特性可用于模型剪枝。传统剪枝方法通常遵循"先剪枝后微调"的范式。由于被剪枝部分仍包含有价值信息,静态移除这些参数而不更新剩余参数往往会导致不可逆的性能下降,需要昂贵的恢复性微调(RFT)来维持性能。为解决这一问题,我们提出了一种新范式:先进行正则化处理,再实施剪枝。基于此范式,我们提出了ELDeR:通过数据驱动的正则化分层剪枝实现高效LLMs。该方法首先为每个Transformer层的输出乘以初始权重,随后通过少量数据以简单方式迭代学习各层的权重参数。之后对权重较小层的输入输出差异施加正则化约束,迫使信息转移至保留层。与直接剪枝相比,ELDeR显著降低了参数直接移除造成的信息损失,从而更好地保持了模型的语言建模能力。实验结果表明,相较于强大的分层结构化剪枝方法,ELDeR在取得更优性能的同时大幅降低了RFT计算成本。由于ELDeR采用分层剪枝策略,其端到端加速效果显著,为构建高效LLMs提供了极具前景的技术方案。
Think or Not? Exploring Thinking Efficiency in Large Reasoning Models via an Information-Theoretic Lens
Abstract
arXiv:2505.18237v1 Announce Type: cross Abstract: The recent rise of Large Reasoning Models (LRMs) has significantly improved multi-step reasoning performance, but often at the cost of generating excessively long reasoning chains. This paper revisits the efficiency of such reasoning processes through an information-theoretic lens, revealing a fundamental trade-off between reasoning length and semantic efficiency. We propose two metrics, InfoBias and InfoGain, to quantify divergence from ideal reasoning paths and stepwise information contribution, respectively. Empirical analyses show that longer reasoning chains tend to exhibit higher information bias and diminishing information gain, especially for incorrect answers. Motivated by these findings, we introduce an entropy-based Adaptive Think strategy that dynamically halts reasoning once confidence is sufficiently high, improving efficiency while maintaining competitive accuracy. Compared to the Vanilla Think approach (default mode), our strategy yields a 1.10% improvement in average accuracy and a 50.80% reduction in token usage on QwQ-32B across six benchmark tasks spanning diverse reasoning types and difficulty levels, demonstrating superior efficiency and reasoning performance. These results underscore the promise of entropy-based methods for enhancing both accuracy and cost-effiiciency in large language model deployment.
摘要
近年来,大型推理模型(LRMs)的兴起显著提升了多步推理性能,但往往伴随生成冗余推理链的问题。本文通过信息论视角重新审视推理过程的效率,揭示了推理长度与语义效率之间的本质权衡。我们提出InfoBias(信息偏差)和InfoGain(信息增益)两个量化指标,分别用于衡量推理路径与理想状态的偏离程度及步骤间信息贡献。实证分析表明,过长的推理链通常伴随更高信息偏差和递减的信息增益,错误答案中这种现象尤为显著。基于此发现,我们提出基于信息熵的自适应推理策略(Adaptive Think),在置信度达标时动态终止推理过程。相比基线方法(Vanilla Think),该策略在涵盖六类推理任务和难度等级的QwQ-32B基准测试中实现平均准确率提升1.10%,同时减少50.80%的token消耗,展现出卓越的效率和推理性能。这些结果印证了基于信息熵的方法在提升大语言模型部署的准确性与成本效益方面的潜力。
Multi-Scale Probabilistic Generation Theory: A Hierarchical Framework for Interpreting Large Language Models
Abstract
arXiv:2505.18244v1 Announce Type: cross Abstract: Large Transformer based language models achieve remarkable performance but remain opaque in how they plan, structure, and realize text. We introduce Multi_Scale Probabilistic Generation Theory (MSPGT), a hierarchical framework that factorizes generation into three semantic scales_global context, intermediate structure, and local word choices and aligns each scale with specific layer ranges in Transformer architectures. To identify scale boundaries, we propose two complementary metrics: attention span thresholds and inter layer mutual information peaks. Across four representative models (GPT-2, BERT, RoBERTa, and T5), these metrics yield stable local/intermediate/global partitions, corroborated by probing tasks and causal interventions. We find that decoder_only models allocate more layers to intermediate and global processing while encoder_only models emphasize local feature extraction. Through targeted interventions, we demonstrate that local scale manipulations primarily influence lexical diversity, intermediate-scale modifications affect sentence structure and length, and global_scale perturbations impact discourse coherence all with statistically significant effects. MSPGT thus offers a unified, architecture-agnostic method for interpreting, diagnosing, and controlling large language models, bridging the gap between mechanistic interpretability and emergent capabilities.
摘要
基于Transformer的大型语言模型表现出卓越性能,但其文本生成过程中的规划、结构与实现机制仍不透明。本研究提出多尺度概率生成理论(MSPGT),该分层框架将生成过程分解为三个语义尺度:全局语境、中间结构和局部词汇选择,并将每个尺度与Transformer架构的特定层级范围对应。为确定尺度边界,我们提出两个互补指标:注意力跨度阈值和层间互信息峰值。在四种代表性模型(GPT-2、BERT、RoBERTa和T5)上的实验表明,这些指标能稳定划分局部/中间/全局层级分区,该结果通过探测任务和因果干预得到验证。研究发现:纯解码器模型将更多层级分配给中间和全局处理,而纯编码器模型更侧重局部特征提取。通过定向干预实验证实:局部尺度调控主要影响词汇多样性,中间尺度修改改变句子结构和长度,全局尺度扰动则影响语篇连贯性——所有效应均具有统计显著性。MSPGT理论由此提供了一种架构无关的统一方法,可用于大型语言模型的解释、诊断与控制,在机制可解释性与涌现能力之间架设了桥梁。
MetaGen Blended RAG: Higher Accuracy for Domain-Specific Q&A Without Fine-Tuning
Abstract
arXiv:2505.18247v1 Announce Type: cross Abstract: Despite the widespread exploration of Retrieval-Augmented Generation (RAG), its deployment in enterprises for domain-specific datasets remains limited due to poor answer accuracy. These corpora, often shielded behind firewalls in private enterprise knowledge bases, having complex, domain-specific terminology, rarely seen by LLMs during pre-training; exhibit significant semantic variability across domains (like networking, military, or legal, etc.), or even within a single domain like medicine, and thus result in poor context precision for RAG systems. Currently, in such situations, fine-tuning or RAG with fine-tuning is attempted, but these approaches are slow, expensive, and lack generalization for accuracy as the new domain-specific data emerges. We propose an approach for Enterprise Search that focuses on enhancing the retriever for a domain-specific corpus through hybrid query indexes and metadata enrichment. This 'MetaGen Blended RAG' method constructs a metadata generation pipeline using key concepts, topics, and acronyms, and then creates a metadata-enriched hybrid index with boosted search queries. This approach avoids overfitting and generalizes effectively across domains. On the PubMedQA benchmark for the biomedical domain, the proposed method achieves 82% retrieval accuracy and 77% RAG accuracy, surpassing all previous RAG accuracy results without fine-tuning and sets a new benchmark for zero-shot results while outperforming much larger models like GPT3.5. The results are even comparable to the best fine-tuned models on this dataset, and we further demonstrate the robustness and scalability of the approach by evaluating it on other Q&A datasets like SQuAD, NQ etc.
摘要
尽管检索增强生成(RAG)技术已被广泛探索,但由于答案准确性不足,其在企业领域特定数据集中的部署仍受限。这些通常位于企业私有知识库防火墙后的语料库具有复杂且领域专用的术语(如网络、军事或法律等领域),这些术语在LLMs预训练阶段极少出现;同时不同领域(甚至医学等单一领域内部)存在显著的语义差异性,导致RAG系统的上下文精确度低下。当前此类场景通常尝试微调或"微调+RAG"方案,但这些方法存在速度慢、成本高且随新增领域数据出现时泛化能力不足的缺陷。我们提出一种企业搜索解决方案,通过混合查询索引与元数据增强来优化领域专用语料库的检索器。该"MetaGen混合RAG"方法构建了基于关键概念、主题及缩略词的元数据生成管道,继而创建具有增强搜索查询的元数据混合索引。该方法避免了过拟合问题并能有效实现跨领域泛化。在生物医学领域的PubMedQA基准测试中,所提方法取得82%的检索准确率和77%的RAG准确率,超越所有无需微调的既往RAG精度结果,为零样本效果树立了新基准,同时优于GPT3.5等更大规模模型。其效果甚至可媲美该数据集上最佳微调模型,我们进一步通过SQuAD、NQ等问答数据集验证了该方法的鲁棒性与可扩展性。
Is It Bad to Work All the Time? Cross-Cultural Evaluation of Social Norm Biases in GPT-4
Abstract
arXiv:2505.18322v1 Announce Type: cross Abstract: LLMs have been demonstrated to align with the values of Western or North American cultures. Prior work predominantly showed this effect through leveraging surveys that directly ask (originally people and now also LLMs) about their values. However, it is hard to believe that LLMs would consistently apply those values in real-world scenarios. To address that, we take a bottom-up approach, asking LLMs to reason about cultural norms in narratives from different cultures. We find that GPT-4 tends to generate norms that, while not necessarily incorrect, are significantly less culture-specific. In addition, while it avoids overtly generating stereotypes, the stereotypical representations of certain cultures are merely hidden rather than suppressed in the model, and such stereotypes can be easily recovered. Addressing these challenges is a crucial step towards developing LLMs that fairly serve their diverse user base.
摘要
已有研究表明,大型语言模型(LLMs)与西方或北美文化价值观保持一致。先前工作主要通过直接询问(最初是人类,现在也包括LLMs)其价值观的调查来证明这一效应。然而,很难相信LLMs会在现实场景中始终如一地应用这些价值观。为此,我们采用自下而上的方法,要求LLMs对不同文化叙事中的文化规范进行推理。我们发现,GPT-4倾向于生成的规范虽然不一定错误,但显著缺乏文化特异性。此外,尽管它避免公然生成刻板印象,但某些文化的刻板表征在模型中只是被隐藏而非消除,这类刻板印象很容易被恢复。解决这些挑战是开发能够公平服务多元化用户群体的LLMs的关键一步。
TAGS: A Test-Time Generalist-Specialist Framework with Retrieval-Augmented Reasoning and Verification
Abstract
arXiv:2505.18283v1 Announce Type: cross Abstract: Recent advances such as Chain-of-Thought prompting have significantly improved large language models (LLMs) in zero-shot medical reasoning. However, prompting-based methods often remain shallow and unstable, while fine-tuned medical LLMs suffer from poor generalization under distribution shifts and limited adaptability to unseen clinical scenarios. To address these limitations, we present TAGS, a test-time framework that combines a broadly capable generalist with a domain-specific specialist to offer complementary perspectives without any model fine-tuning or parameter updates. To support this generalist-specialist reasoning process, we introduce two auxiliary modules: a hierarchical retrieval mechanism that provides multi-scale exemplars by selecting examples based on both semantic and rationale-level similarity, and a reliability scorer that evaluates reasoning consistency to guide final answer aggregation. TAGS achieves strong performance across nine MedQA benchmarks, boosting GPT-4o accuracy by 13.8%, DeepSeek-R1 by 16.8%, and improving a vanilla 7B model from 14.1% to 23.9%. These results surpass several fine-tuned medical LLMs, without any parameter updates. The code will be available at https://github.com/JianghaoWu/TAGS.
摘要
近期诸如思维链提示等进展显著提升了大型语言模型(LLMs)在零样本医疗推理任务中的表现。然而,基于提示的方法往往存在浅层推理和不稳定的问题,而经过微调的医疗LLMs则在分布偏移下泛化能力不足,且对未见临床场景的适应性有限。为应对这些局限性,我们提出TAGS框架——一种测试时方法,通过将通用基础模型与领域专家模型相结合,在不进行任何模型微调或参数更新的情况下提供互补视角。为支持这种通用-专家协同推理机制,我们引入两个辅助模块:分层检索机制(通过语义和原理级相似性筛选示例,提供多尺度参考样本)和可靠性评分器(评估推理一致性以指导最终答案聚合)。TAGS在九项MedQA基准测试中表现优异,将GPT-4o准确率提升13.8%,DeepSeek-R1提升16.8%,并将基础7B模型性能从14.1%提升至23.9%。这些结果超越了多个经过微调的医疗LLMs,且无需任何参数更新。代码将在https://github.com/JianghaoWu/TAGS发布。
CrashAgent: Crash Scenario Generation via Multi-modal Reasoning
Abstract
arXiv:2505.18341v1 Announce Type: cross Abstract: Training and evaluating autonomous driving algorithms requires a diverse range of scenarios. However, most available datasets predominantly consist of normal driving behaviors demonstrated by human drivers, resulting in a limited number of safety-critical cases. This imbalance, often referred to as a long-tail distribution, restricts the ability of driving algorithms to learn from crucial scenarios involving risk or failure, scenarios that are essential for humans to develop driving skills efficiently. To generate such scenarios, we utilize Multi-modal Large Language Models to convert crash reports of accidents into a structured scenario format, which can be directly executed within simulations. Specifically, we introduce CrashAgent, a multi-agent framework designed to interpret multi-modal real-world traffic crash reports for the generation of both road layouts and the behaviors of the ego vehicle and surrounding traffic participants. We comprehensively evaluate the generated crash scenarios from multiple perspectives, including the accuracy of layout reconstruction, collision rate, and diversity. The resulting high-quality and large-scale crash dataset will be publicly available to support the development of safe driving algorithms in handling safety-critical situations.
摘要
训练和评估自动驾驶算法需要多样化的场景。然而,现有数据集主要由人类驾驶员展示的正常驾驶行为构成,导致安全关键案例数量有限。这种通常被称为长尾分布的数据失衡问题,限制了驾驶算法从涉及风险或故障的关键场景中学习的能力,而这些场景对人类高效掌握驾驶技能至关重要。为生成此类场景,我们利用多模态大语言模型将交通事故报告转化为结构化场景格式,使其可直接在仿真环境中执行。具体而言,我们提出CrashAgent——一个多智能体框架,旨在解析多模态真实世界交通事故报告,以生成道路布局、自车及周围交通参与者的行为。我们从布局重建准确性、碰撞率和多样性等多维度对生成的碰撞场景进行全面评估。最终形成的高质量大规模碰撞数据集将公开提供,以支持安全驾驶算法处理关键安全场景的研发工作。
PerMedCQA: Benchmarking Large Language Models on Medical Consumer Question Answering in Persian Language
Abstract
arXiv:2505.18331v1 Announce Type: cross Abstract: Medical consumer question answering (CQA) is crucial for empowering patients by providing personalized and reliable health information. Despite recent advances in large language models (LLMs) for medical QA, consumer-oriented and multilingual resources, particularly in low-resource languages like Persian, remain sparse. To bridge this gap, we present PerMedCQA, the first Persian-language benchmark for evaluating LLMs on real-world, consumer-generated medical questions. Curated from a large medical QA forum, PerMedCQA contains 68,138 question-answer pairs, refined through careful data cleaning from an initial set of 87,780 raw entries. We evaluate several state-of-the-art multilingual and instruction-tuned LLMs, utilizing MedJudge, a novel rubric-based evaluation framework driven by an LLM grader, validated against expert human annotators. Our results highlight key challenges in multilingual medical QA and provide valuable insights for developing more accurate and context-aware medical assistance systems. The data is publicly available on https://huggingface.co/datasets/NaghmehAI/PerMedCQA
摘要
医疗消费者问答(CQA)通过提供个性化且可靠的健康信息,对增强患者自主权至关重要。尽管目前基于大语言模型(LLM)的医疗问答系统取得进展,但面向消费者且支持多语言的资源——尤其是波斯语等低资源语言——仍然匮乏。为填补这一空白,我们推出首个波斯语基准测试集PerMedCQA,用于评估LLM处理真实世界消费者医疗问题的能力。该数据集从大型医疗问答论坛中精选而成,包含68,138个问答对,是从87,780条原始条目经过严格数据清洗后获得的。我们采用基于量规的新型评估框架MedJudge(由LLM评分器驱动并经专家人工标注验证),对多个最先进的多语言及指令微调LLM进行了评估。研究结果揭示了多语言医疗问答中的关键挑战,并为开发更精准、更具情境感知的医疗辅助系统提供了重要见解。数据已公开于https://huggingface.co/datasets/NaghmehAI/PerMedCQA。
Task Specific Pruning with LLM-Sieve: How Many Parameters Does Your Task Really Need?
Abstract
arXiv:2505.18350v1 Announce Type: cross Abstract: As Large Language Models (LLMs) are increasingly being adopted for narrow tasks - such as medical question answering or sentiment analysis - and deployed in resource-constrained settings, a key question arises: how many parameters does a task actually need? In this work, we present LLM-Sieve, the first comprehensive framework for task-specific pruning of LLMs that achieves 20-75% parameter reduction with only 1-5% accuracy degradation across diverse domains. Unlike prior methods that apply uniform pruning or rely on low-rank approximations of weight matrices or inputs in isolation, LLM-Sieve (i) learns task-aware joint projections to better approximate output behavior, and (ii) employs a Genetic Algorithm to discover differentiated pruning levels for each matrix. LLM-Sieve is fully compatible with LoRA fine-tuning and quantization, and uniquely demonstrates strong generalization across datasets within the same task domain. Together, these results establish a practical and robust mechanism to generate smaller performant task-specific models.
摘要
随着大语言模型(LLM)日益应用于特定任务(如医疗问答或情感分析)并部署于资源受限环境,一个关键问题随之产生:特定任务实际需要多少参数量?本研究提出LLM-Sieve框架,这是首个面向任务定制的LLM剪枝综合方案,能在多样化领域实现20-75%的参数削减,同时仅产生1-5%的精度损失。与传统采用均匀剪枝或单独依赖权重矩阵/输入低秩近似的方法不同,LLM-Sieve具有两大创新:(i) 通过任务感知的联合投影学习更精准逼近输出行为;(ii) 采用遗传算法为每个矩阵发现差异化剪枝强度。该框架完全兼容LoRA微调与量化技术,并独特展现出同任务领域内跨数据集的强泛化能力。这些成果共同构建了一个实用且鲁棒的机制,可生成更小规模的高性能任务专用模型。
A Critical Evaluation of Defenses against Prompt Injection Attacks
Abstract
arXiv:2505.18333v1 Announce Type: cross Abstract: Large Language Models (LLMs) are vulnerable to prompt injection attacks, and several defenses have recently been proposed, often claiming to mitigate these attacks successfully. However, we argue that existing studies lack a principled approach to evaluating these defenses. In this paper, we argue the need to assess defenses across two critical dimensions: (1) effectiveness, measured against both existing and adaptive prompt injection attacks involving diverse target and injected prompts, and (2) general-purpose utility, ensuring that the defense does not compromise the foundational capabilities of the LLM. Our critical evaluation reveals that prior studies have not followed such a comprehensive evaluation methodology. When assessed using this principled approach, we show that existing defenses are not as successful as previously reported. This work provides a foundation for evaluating future defenses and guiding their development. Our code and data are available at: https://github.com/PIEval123/PIEval.
摘要
大型语言模型(LLMs)易受提示注入攻击,近期已有若干防御方案被提出,且常宣称能有效缓解此类攻击。然而,我们认为现有研究缺乏评估这些防御措施的体系化方法。本文提出应从两个关键维度进行评估:(1)防御有效性,需针对现有及自适应的提示注入攻击进行测试,涵盖多样化目标提示与注入提示;(2)通用功能性,需确保防御机制不影响LLM的基础能力。批判性评估表明,先前研究均未遵循如此全面的评估方法。当采用本研究的体系化方法进行评估时,我们发现现有防御方案的实际效果远低于既有报道。本工作为未来防御方案的评估与开发提供了方法论基础。代码与数据详见:https://github.com/PIEval123/PIEval。
SchemaGraphSQL: Efficient Schema Linking with Pathfinding Graph Algorithms for Text-to-SQL on Large-Scale Databases
Abstract
arXiv:2505.18363v1 Announce Type: cross Abstract: Text-to-SQL systems translate natural language questions into executable SQL queries, and recent progress with large language models (LLMs) has driven substantial improvements in this task. Schema linking remains a critical component in Text-to-SQL systems, reducing prompt size for models with narrow context windows and sharpening model focus even when the entire schema fits. We present a zero-shot, training-free schema linking approach that first constructs a schema graph based on foreign key relations, then uses a single prompt to Gemini 2.5 Flash to extract source and destination tables from the user query, followed by applying classical path-finding algorithms and post-processing to identify the optimal sequence of tables and columns that should be joined, enabling the LLM to generate more accurate SQL queries. Despite being simple, cost-effective, and highly scalable, our method achieves state-of-the-art results on the BIRD benchmark, outperforming previous specialized, fine-tuned, and complex multi-step LLM-based approaches. We conduct detailed ablation studies to examine the precision-recall trade-off in our framework. Additionally, we evaluate the execution accuracy of our schema filtering method compared to other approaches across various model sizes.
摘要
文本到SQL系统将自然语言问题转化为可执行的SQL查询,而大型语言模型(LLM)的最新进展显著提升了该任务的性能。模式链接仍是文本到SQL系统的关键组件,它既能缩减上下文窗口有限模型的提示规模,也能在完整模式适配时增强模型专注力。我们提出一种零样本、无需训练的模式链接方法:首先基于外键关系构建模式图,随后使用单一提示通过Gemini 2.5 Flash从用户查询中提取源表和目标表,再应用经典路径查找算法及后处理技术确定最优的表列连接序列,从而使LLM能生成更精确的SQL查询。尽管该方法简单、成本效益高且具备高度可扩展性,但在BIRD基准测试中仍取得了最先进的成果,超越了先前基于LLM的专用、微调及复杂多步骤方法。我们通过详细消融实验研究了框架中的精确率-召回率权衡,并对比不同模型规模下模式过滤方法与其他方案在执行准确率上的表现。
The Unreasonable Effectiveness of Model Merging for Cross-Lingual Transfer in LLMs
Abstract
arXiv:2505.18356v1 Announce Type: cross Abstract: Large language models (LLMs) still struggle across tasks outside of high-resource languages. In this work, we investigate cross-lingual transfer to lower-resource languages where task-specific post-training data is scarce. Building on prior work, we first validate that the subsets of model parameters that matter most for mathematical reasoning and multilingual capabilities are distinctly non-overlapping. To exploit this implicit separability between task and target language parameterization, we develop and analyze numerous modular frameworks to improve the composition of the two during fine-tuning. These methods generally employ freezing parameters or post hoc model merging to assign math and language improvement to different key parts of the LLM. In the absence of in-language math data, we demonstrate that the modular approaches successfully improve upon baselines across three languages, four models, and two fine-tuning paradigms (full and LoRA). Furthermore, we identify the most consistently successful modular method to be fine-tuning separate language and math experts and model merging via Layer-Swapping, somewhat surprisingly. We offer possible explanations for this result via recent works on the linearity of task vectors. We further explain this by empirically showing that reverting less useful fine-tuning updates after training often outperforms freezing them from the start.
摘要
大语言模型(LLMs)在非高资源语言任务中仍面临困难。本研究探讨了在任务特定训练数据稀缺的低资源语言中的跨语言迁移。基于先前工作,我们首先验证了模型参数中对数学推理和多语言能力最关键的子集明显不重叠。为利用任务与目标语言参数化之间的这种隐式可分离性,我们开发并分析了多种模块化框架,以改进两者在微调期间的组合。这些方法通常采用参数冻结或事后模型融合技术,将数学与语言能力的提升分别分配给大语言模型的不同关键部分。在缺乏目标语言数学数据的情况下,我们证明这些模块化方法在三种语言、四种模型和两种微调范式(全参数与LoRA)中均成功超越了基线水平。此外,我们发现最稳定有效的模块化方法是通过层交换技术微调独立的语言与数学专家模型并进行融合,这一结果出人意料。我们结合近期关于任务向量线性特征的研究给出了可能的解释,并通过实证表明:训练后回退低效的微调更新,往往优于从一开始就冻结这些参数。
Next-token pretraining implies in-context learning
Abstract
arXiv:2505.18373v1 Announce Type: cross Abstract: We argue that in-context learning (ICL) predictably arises from standard self-supervised next-token pretraining, rather than being an exotic emergent property. This work establishes the foundational principles of this emergence by focusing on in-distribution ICL, demonstrating how models necessarily adapt to context when trained on token sequences, especially from non-ergodic sources. Our information-theoretic framework precisely predicts these in-distribution ICL dynamics (i.e., context-dependent loss reduction). We verify this with experiments using synthetic datasets of differing types of correlational structure, reproducing characteristic phenomena like phase transitions in training loss for induction head formation and power-law scaling of in-context loss. We further show that a model's in-context performance on any task is mathematically coupled to the ensemble of tasks seen in pretraining, offering a fundamental explanation, grounded in architecture- and modality-independent principles, for such inference-time learning.
摘要
我们提出,上下文学习(ICL)可预测地源自标准的自监督下一词元预训练,而非一种特殊的涌现属性。本研究通过聚焦同分布ICL现象,阐明了这种涌现的基本原理:当模型在非遍历性数据源的词元序列上训练时,必然发展出适应上下文的能力。我们的信息论框架精确预测了这些同分布ICL动态(即上下文依赖的损失降低)。通过在不同相关结构的合成数据集上进行实验,我们验证了该框架的有效性,复现了训练损失中的特征现象——如归纳头形成时的相变和上下文损失的幂律缩放。进一步研究表明,模型在任何任务上的上下文表现都与预训练中接触的任务集合存在数学耦合,这为这种推理时学习提供了基于架构与模态无关原理的根本性解释。
LatentLLM: Attention-Aware Joint Tensor Compression
Abstract
arXiv:2505.18413v1 Announce Type: cross Abstract: Modern foundation models such as large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources. We propose a new framework to convert such LLMs/LMMs into a reduced-dimension latent structure. Our method extends a local activation-aware tensor decomposition to a global attention-aware joint tensor de-composition. Our framework can significantly improve the model accuracy over the existing model compression methods when reducing the latent dimension to realize computationally/memory-efficient LLMs/LLMs. We show the benefit on several benchmark including multi-modal reasoning tasks.
摘要
现代基础模型(如大语言模型LLMs和大规模多模态模型LMMs)需要消耗巨大的计算和内存资源。我们提出了一种新框架,可将此类LLMs/LMMs转换为降维潜在结构。该方法将局部激活感知的张量分解扩展为全局注意力感知的联合张量分解。当降低潜在维度以实现计算/内存高效的LLMs/LMMs时,我们的框架能显著提升现有模型压缩方法的精度。我们在包括多模态推理任务在内的多个基准测试中验证了该方法的优势。
Thought calibration: Efficient and confident test-time scaling
Abstract
arXiv:2505.18404v1 Announce Type: cross Abstract: Reasoning large language models achieve impressive test-time scaling by thinking for longer, but this performance gain comes at significant compute cost. Directly limiting test-time budget hurts overall performance, but not all problems are equally difficult. We propose thought calibration to decide dynamically when thinking can be terminated. To calibrate our decision rule, we view a language model's growing body of thoughts as a nested sequence of reasoning trees, where the goal is to identify the point at which novel reasoning plateaus. We realize this framework through lightweight probes that operate on top of the language model's hidden representations, which are informative of both the reasoning structure and overall consistency of response. Based on three reasoning language models and four datasets, thought calibration preserves model performance with up to a 60% reduction in thinking tokens on in-distribution data, and up to 20% in out-of-distribution data.
摘要
大型语言模型通过延长推理时间实现了显著的测试时性能提升,但这种性能增益伴随着高昂的计算成本。直接限制测试时预算会损害整体性能,但并非所有问题都具有同等难度。我们提出思维校准方法,用于动态决定何时终止推理过程。为校准决策规则,我们将语言模型不断增长的思维体视为嵌套的推理树序列,其目标是识别新推理达到平台期的临界点。该框架通过轻量级探针实现,这些探针作用于语言模型的隐藏表示层,既能捕捉推理结构信息,又能评估响应整体一致性。基于三个推理语言模型和四个数据集的实验表明,思维校准在分布内数据上可减少高达60%的推理标记消耗,在分布外数据上减少达20%,同时保持模型性能。
-MoE: Test-Time Pruning as Micro-Grained Mixture-of-Experts
Abstract
arXiv:2505.18451v1 Announce Type: cross Abstract: To tackle the huge computational demand of large foundation models, activation-aware compression techniques without retraining have been introduced. However, since these rely on calibration data, domain shift may arise for unknown downstream tasks. With a computationally efficient calibration, activation-aware pruning can be executed for every prompt adaptively, yet achieving reduced complexity at inference. We formulate it as a mixture of micro-experts, called -MoE. Several experiments demonstrate that -MoE can dynamically adapt to task/prompt-dependent structured sparsity on the fly.
摘要
为应对大型基础模型巨大的计算需求,无需重新训练的激活感知压缩技术应运而生。然而,由于这些技术依赖校准数据,在未知下游任务中可能出现域偏移问题。通过计算高效的校准过程,我们实现了针对每个提示的自适应激活感知剪枝,同时降低了推理复杂度。我们将其建模为一种微型专家混合系统(μ-MoE)。多项实验表明,μ-MoE能够实时动态适应任务/提示相关的结构化稀疏性。
Retrieval Augmented Generation-based Large Language Models for Bridging Transportation Cybersecurity Legal Knowledge Gaps
Abstract
arXiv:2505.18426v1 Announce Type: cross Abstract: As connected and automated transportation systems evolve, there is a growing need for federal and state authorities to revise existing laws and develop new statutes to address emerging cybersecurity and data privacy challenges. This study introduces a Retrieval-Augmented Generation (RAG) based Large Language Model (LLM) framework designed to support policymakers by extracting relevant legal content and generating accurate, inquiry-specific responses. The framework focuses on reducing hallucinations in LLMs by using a curated set of domain-specific questions to guide response generation. By incorporating retrieval mechanisms, the system enhances the factual grounding and specificity of its outputs. Our analysis shows that the proposed RAG-based LLM outperforms leading commercial LLMs across four evaluation metrics: AlignScore, ParaScore, BERTScore, and ROUGE, demonstrating its effectiveness in producing reliable and context-aware legal insights. This approach offers a scalable, AI-driven method for legislative analysis, supporting efforts to update legal frameworks in line with advancements in transportation technologies.
摘要
随着互联与自动化交通系统的发展,联邦和州级监管机构亟需修订现有法律并制定新法规以应对新兴的网络安全和数据隐私挑战。本研究提出了一种基于检索增强生成(RAG)的大语言模型(LLM)框架,旨在通过提取相关法律内容并生成精准的查询响应来支持政策制定者。该框架通过使用特定领域问题集指导响应生成,有效减少大语言模型的幻觉现象。通过整合检索机制,系统显著增强了输出结果的事实依据与针对性。分析表明,基于RAG的大语言模型在AlignScore、ParaScore、BERTScore和ROUGE四项评估指标上均优于主流商用大语言模型,证实其在生成可靠且情境感知的法律见解方面的有效性。该方法为立法分析提供了可扩展的人工智能驱动解决方案,支持交通技术发展背景下的法律框架更新工作。
TNG-CLIP:Training-Time Negation Data Generation for Negation Awareness of CLIP
Abstract
arXiv:2505.18434v1 Announce Type: cross Abstract: Vision-language models (VLMs), such as CLIP, have demonstrated strong performance across a range of downstream tasks. However, CLIP is still limited in negation understanding: the ability to recognize the absence or exclusion of a concept. Existing methods address the problem by using a large language model (LLM) to generate large-scale data of image captions containing negation for further fine-tuning CLIP. However, these methods are both time- and compute-intensive, and their evaluations are typically restricted to image-text matching tasks. To expand the horizon, we (1) introduce a training-time negation data generation pipeline such that negation captions are generated during the training stage, which only increases 2.5% extra training time, and (2) we propose the first benchmark, Neg-TtoI, for evaluating text-to-image generation models on prompts containing negation, assessing model's ability to produce semantically accurate images. We show that our proposed method, TNG-CLIP, achieves SOTA performance on diverse negation benchmarks of image-to-text matching, text-to-image retrieval, and image generation.
摘要
视觉语言模型(VLM,如CLIP)在一系列下游任务中展现出强劲性能。然而,CLIP在否定理解(即识别概念缺失或排除的能力)方面仍存在局限。现有方法通过使用大语言模型(LLM)生成包含否定的大规模图像描述数据以微调CLIP,但这类方法耗时且计算密集,且评估通常仅限于图文匹配任务。为拓展研究边界,我们(1)提出一种训练时否定数据生成流程,使否定描述在训练阶段动态生成,仅增加2.5%的额外训练时间;(2)首次建立Neg-TtoI基准,用于评估文本到图像生成模型处理含否定提示时的语义准确性。实验表明,我们提出的TNG-CLIP方法在图文匹配、文本到图像检索及图像生成等多类否定基准测试中均达到最先进性能。
Efficient Long CoT Reasoning in Small Language Models
Abstract
arXiv:2505.18440v1 Announce Type: cross Abstract: Recent large reasoning models such as DeepSeek-R1 exhibit strong complex problems solving abilities by generating long chain-of-thought (CoT) reasoning steps. It is challenging to directly train small language models (SLMs) to emerge long CoT. Thus, distillation becomes a practical method to enable SLMs for such reasoning ability. However, the long CoT often contains a lot of redundant contents (e.g., overthinking steps) which may make SLMs hard to learn considering their relatively poor capacity and generalization. To address this issue, we propose a simple-yet-effective method to prune unnecessary steps in long CoT, and then employ an on-policy method for the SLM itself to curate valid and useful long CoT training data. In this way, SLMs can effectively learn efficient long CoT reasoning and preserve competitive performance at the same time. Experimental results across a series of mathematical reasoning benchmarks demonstrate the effectiveness of the proposed method in distilling long CoT reasoning ability into SLMs which maintains the competitive performance but significantly reduces generating redundant reasoning steps.
摘要
近期诸如DeepSeek-R1等大型推理模型通过生成长链思维(CoT)推理步骤展现出强大的复杂问题解决能力。直接训练小型语言模型(SLMs)实现长链思维涌现具有挑战性,因此蒸馏成为赋予SLMs此类推理能力的实用方法。然而,长链思维常包含大量冗余内容(如过度思考步骤),考虑到SLMs相对有限的容量和泛化能力,这可能使其难以有效学习。针对该问题,我们提出一种简单而有效的方法来修剪长链思维中不必要的步骤,并采用策略内方法让SLM自身筛选有效且有用的长链思维训练数据。通过这种方式,SLMs既能高效学习长链思维推理,又能保持竞争优势。在一系列数学推理基准测试中的实验结果表明,所提方法能有效将长链思维推理能力蒸馏至SLMs,在保持性能竞争力的同时显著减少冗余推理步骤的生成。
Synthesizing and Adapting Error Correction Data for Mobile Large Language Model Applications
Abstract
arXiv:2505.18488v1 Announce Type: cross Abstract: Error correction is an important capability when applying large language models (LLMs) to facilitate user typing on mobile devices. In this paper, we use LLMs to synthesize a high-quality dataset of error correction pairs to evaluate and improve LLMs for mobile applications. We first prompt LLMs with error correction domain knowledge to build a scalable and reliable addition to the existing data synthesis pipeline. We then adapt the synthetic data distribution to match the mobile application domain by reweighting the samples. The reweighting model is learnt by predicting (a handful of) live A/B test metrics when deploying LLMs in production, given the LLM performance on offline evaluation data and scores from a small privacy-preserving on-device language model. Finally, we present best practices for mixing our synthetic data with other data sources to improve model performance on error correction in both offline evaluation and production live A/B testing.
摘要
在将大语言模型(LLMs)应用于移动设备用户输入辅助时,纠错能力至关重要。本文利用LLMs合成高质量纠错配对数据集,以评估并优化移动应用场景下的语言模型性能。我们首先基于纠错领域知识设计提示方案,构建可扩展且可靠的数据合成流程扩展。随后通过样本重加权方法,使合成数据分布适配移动应用领域特性。该重加权模型通过预测生产环境中的少量A/B测试指标进行训练,其输入包括LLMs在离线评估数据上的表现以及小型隐私保护设备端语言模型的评分结果。最后,我们提出混合使用合成数据与其他数据源的最佳实践方案,以提升模型在离线评估和生产环境A/B测试中的纠错性能。
Using Large Language Models to Tackle Fundamental Challenges in Graph Learning: A Comprehensive Survey
Abstract
arXiv:2505.18475v1 Announce Type: cross Abstract: Graphs are a widely used paradigm for representing non-Euclidean data, with applications ranging from social network analysis to biomolecular prediction. Conventional graph learning approaches typically rely on fixed structural assumptions or fully observed data, limiting their effectiveness in more complex, noisy, or evolving settings. Consequently, real-world graph data often violates the assumptions of traditional graph learning methods, in particular, it leads to four fundamental challenges: (1) Incompleteness, real-world graphs have missing nodes, edges, or attributes; (2) Imbalance, the distribution of the labels of nodes or edges and their structures for real-world graphs are highly skewed; (3) Cross-domain Heterogeneity, graphs from different domains exhibit incompatible feature spaces or structural patterns; and (4) Dynamic Instability, graphs evolve over time in unpredictable ways. Recent advances in Large Language Models (LLMs) offer the potential to tackle these challenges by leveraging rich semantic reasoning and external knowledge. This survey provides a comprehensive review of how LLMs can be integrated with graph learning to address the aforementioned challenges. For each challenge, we review both traditional solutions and modern LLM-driven approaches, highlighting how LLMs contribute unique advantages. Finally, we discuss open research questions and promising future directions in this emerging interdisciplinary field. To support further exploration, we have curated a repository of recent advances on graph learning challenges: https://github.com/limengran98/Awesome-Literature-Graph-Learning-Challenges.
From Reddit to Generative AI: Evaluating Large Language Models for Anxiety Support Fine-tuned on Social Media Data
Abstract
arXiv:2505.18464v1 Announce Type: cross Abstract: The growing demand for accessible mental health support, compounded by workforce shortages and logistical barriers, has led to increased interest in utilizing Large Language Models (LLMs) for scalable and real-time assistance. However, their use in sensitive domains such as anxiety support remains underexamined. This study presents a systematic evaluation of LLMs (GPT and Llama) for their potential utility in anxiety support by using real user-generated posts from the r/Anxiety subreddit for both prompting and fine-tuning. Our approach utilizes a mixed-method evaluation framework incorporating three main categories of criteria: (i) linguistic quality, (ii) safety and trustworthiness, and (iii) supportiveness. Results show that fine-tuning LLMs with naturalistic anxiety-related data enhanced linguistic quality but increased toxicity and bias, and diminished emotional responsiveness. While LLMs exhibited limited empathy, GPT was evaluated as more supportive overall. Our findings highlight the risks of fine-tuning LLMs on unprocessed social media content without mitigation strategies.
摘要
随着对便捷心理健康服务需求的日益增长,加之专业人才短缺和地理障碍等因素,利用大语言模型(LLMs)提供可扩展的实时辅助服务受到广泛关注。然而,这类模型在焦虑支持等敏感领域的应用仍缺乏系统研究。本研究通过使用r/Anxiety版块真实用户发帖作为提示词和微调数据,对GPT和Llama等大语言模型在焦虑支持中的潜在效用进行系统评估。我们采用混合方法评估框架,包含三个主要标准类别:(1)语言质量;(2)安全性与可信度;(3)支持性。结果表明,基于自然焦虑数据微调的模型虽提升了语言质量,但毒性偏见增加、情感响应性降低。大语言模型整体表现出有限共情能力,其中GPT被评估为更具支持性。本研究揭示了在缺乏缓解策略情况下,直接使用未经处理的社交媒体内容微调大语言模型的风险。
Invisible Tokens, Visible Bills: The Urgent Need to Audit Hidden Operations in Opaque LLM Services
Abstract
arXiv:2505.18471v1 Announce Type: cross Abstract: Modern large language model (LLM) services increasingly rely on complex, often abstract operations, such as multi-step reasoning and multi-agent collaboration, to generate high-quality outputs. While users are billed based on token consumption and API usage, these internal steps are typically not visible. We refer to such systems as Commercial Opaque LLM Services (COLS). This position paper highlights emerging accountability challenges in COLS: users are billed for operations they cannot observe, verify, or contest. We formalize two key risks: \textit{quantity inflation}, where token and call counts may be artificially inflated, and \textit{quality downgrade}, where providers might quietly substitute lower-cost models or tools. Addressing these risks requires a diverse set of auditing strategies, including commitment-based, predictive, behavioral, and signature-based methods. We further explore the potential of complementary mechanisms such as watermarking and trusted execution environments to enhance verifiability without compromising provider confidentiality. We also propose a modular three-layer auditing framework for COLS and users that enables trustworthy verification across execution, secure logging, and user-facing auditability without exposing proprietary internals. Our aim is to encourage further research and policy development toward transparency, auditability, and accountability in commercial LLM services.
摘要
现代大型语言模型(LLM)服务日益依赖复杂且通常抽象的操作(如多步推理与多智能体协作)来生成高质量输出。尽管用户计费基于令牌消耗和API使用量,但这些内部步骤通常不可见。我们将此类系统称为商业不透明LLM服务(COLS)。本立场文件揭示了COLS中新兴的问责挑战:用户为无法观察、验证或质疑的操作付费。我们形式化了两大风险:\textit{数量膨胀}(令牌和调用计数可能被人为夸大)与\textit{质量降级}(提供商可能悄然替换低成本模型或工具)。应对这些风险需要多样化的审计策略,包括基于承诺、预测、行为及签名的方法。我们进一步探讨了水印与可信执行环境等补充机制在提升可验证性同时不损害提供商机密性的潜力。此外,我们提出了面向COLS与用户的模块化三层审计框架,该框架支持跨执行、安全日志记录和用户可审计性的可信验证,且无需暴露专有内部信息。本研究旨在推动商业LLM服务在透明度、可审计性与问责制方面的进一步研究与政策制定。
AcuRank: Uncertainty-Aware Adaptive Computation for Listwise Reranking
Abstract
arXiv:2505.18512v1 Announce Type: cross Abstract: Listwise reranking with large language models (LLMs) enhances top-ranked results in retrieval-based applications. Due to the limit in context size and high inference cost of long context, reranking is typically performed over a fixed size of small subsets, with the final ranking aggregated from these partial results. This fixed computation disregards query difficulty and document distribution, leading to inefficiencies. We propose AcuRank, an adaptive reranking framework that dynamically adjusts both the amount and target of computation based on uncertainty estimates over document relevance. Using a Bayesian TrueSkill model, we iteratively refine relevance estimates until reaching sufficient confidence levels, and our explicit modeling of ranking uncertainty enables principled control over reranking behavior and avoids unnecessary updates to confident predictions. Results on the TREC-DL and BEIR benchmarks show that our method consistently achieves a superior accuracy-efficiency trade-off and scales better with compute than fixed-computation baselines. These results highlight the effectiveness and generalizability of our method across diverse retrieval tasks and LLM-based reranking models.
摘要
基于大语言模型(LLMs)的列表式重排序能够提升检索应用中排名靠前的结果质量。由于上下文长度限制及长上下文推理成本较高,重排序通常仅针对固定数量的小规模候选子集进行,最终排序结果由这些局部结果聚合而成。这种固定计算模式忽视了查询难度与文档分布特性,导致效率低下。我们提出AcuRank——一种自适应重排序框架,通过基于文档相关性不确定性估计的动态机制,自适应调整计算量与计算目标。该方法采用贝叶斯TrueSkill模型迭代优化相关性估计直至达到足够置信度,其显式的排序不确定性建模实现了对重排序行为的可控调节,避免对高置信度预测进行不必要的更新。在TREC-DL和BEIR基准测试上的实验表明,本方法始终能实现更优的准确率-效率权衡,且计算扩展性优于固定计算基线。这些结果验证了我们的方法在不同检索任务和基于LLM的重排序模型中具有显著的有效性与泛化能力。
From Word to World: Evaluate and Mitigate Culture Bias via Word Association Test
Abstract
arXiv:2505.18562v1 Announce Type: cross Abstract: The human-centered word association test (WAT) serves as a cognitive proxy, revealing sociocultural variations through lexical-semantic patterns. We extend this test into an LLM-adaptive, free-relation task to assess the alignment of large language models (LLMs) with cross-cultural cognition. To mitigate the culture preference, we propose CultureSteer, an innovative approach that integrates a culture-aware steering mechanism to guide semantic representations toward culturally specific spaces. Experiments show that current LLMs exhibit significant bias toward Western cultural (notably in American) schemas at the word association level. In contrast, our model substantially improves cross-cultural alignment, surpassing prompt-based methods in capturing diverse semantic associations. Further validation on culture-sensitive downstream tasks confirms its efficacy in fostering cognitive alignment across cultures. This work contributes a novel methodological paradigm for enhancing cultural awareness in LLMs, advancing the development of more inclusive language technologies.
摘要
以人为中心的词汇联想测试(WAT)作为认知代理,通过词汇语义模式揭示社会文化差异。本研究将该测试扩展为适应大语言模型(LLM)的自由联想任务,用于评估大语言模型与跨文化认知的契合度。为消除文化偏好,我们提出CultureSteer创新方法,通过集成文化感知引导机制,将语义表征导向特定文化空间。实验表明,当前大语言模型在词汇联想层面显著偏向西方文化(尤其是美国)图式;相较之下,我们的模型显著提升了跨文化契合度,在捕捉多样化语义关联方面超越基于提示词的方法。在文化敏感性下游任务中的进一步验证证实了该方法在促进跨文化认知对齐方面的有效性。本研究为增强大语言模型的文化意识提供了新颖的方法论范式,推动了更具包容性语言技术的发展。
G1: Teaching LLMs to Reason on Graphs with Reinforcement Learning
Abstract
arXiv:2505.18499v1 Announce Type: cross Abstract: Although Large Language Models (LLMs) have demonstrated remarkable progress, their proficiency in graph-related tasks remains notably limited, hindering the development of truly general-purpose models. Previous attempts, including pretraining graph foundation models or employing supervised fine-tuning, often face challenges such as the scarcity of large-scale, universally represented graph data. We introduce G1, a simple yet effective approach demonstrating that Reinforcement Learning (RL) on synthetic graph-theoretic tasks can significantly scale LLMs' graph reasoning abilities. To enable RL training, we curate Erd~os, the largest graph reasoning dataset to date comprising 50 diverse graph-theoretic tasks of varying difficulty levels, 100k training data and 5k test data, all drived from real-world graphs. With RL on Erd~os, G1 obtains substantial improvements in graph reasoning, where our finetuned 3B model even outperforms Qwen2.5-72B-Instruct (24x size). RL-trained models also show strong zero-shot generalization to unseen tasks, domains, and graph encoding schemes, including other graph-theoretic benchmarks as well as real-world node classification and link prediction tasks, without compromising general reasoning abilities. Our findings offer an efficient, scalable path for building strong graph reasoners by finetuning LLMs with RL on graph-theoretic tasks, which combines the strengths of pretrained LLM capabilities with abundant, automatically generated synthetic data, suggesting that LLMs possess graph understanding abilities that RL can elicit successfully.
摘要
尽管大型语言模型(LLMs)已展现出显著进展,但其在图相关任务中的表现仍存在明显局限,这阻碍了通用模型的真正发展。先前尝试(包括预训练图基础模型或采用监督微调)常面临大规模通用图数据稀缺等挑战。我们提出G1——一种简单而有效的方法,证明在合成图论任务上通过强化学习(RL)可显著扩展LLMs的图推理能力。为支持RL训练,我们构建了迄今最大规模的图推理数据集Erd~os,包含50种不同难度的多样化图论任务、10万训练数据和5千测试数据,所有数据均源自真实世界图结构。通过在Erd~os上进行RL训练,G1实现了图推理能力的显著提升:经微调的30亿参数模型甚至超越Qwen2.5-72B-Instruct(规模为其24倍)。RL训练模型还展现出对未见任务、领域及图编码方案的强大零样本泛化能力,包括其他图论基准测试以及真实世界的节点分类和链接预测任务,且不影响通用推理能力。我们的研究为构建强图推理器提供了一条高效、可扩展的路径:通过在图论任务上对LLMs进行RL微调,将预训练LLM能力与自动生成的丰富合成数据优势相结合,这表明LLMs具备可通过RL成功激发的图理解能力。
FedHL: Federated Learning for Heterogeneous Low-Rank Adaptation via Unbiased Aggregation
Abstract
arXiv:2505.18494v1 Announce Type: cross Abstract: Federated Learning (FL) facilitates the fine-tuning of Foundation Models (FMs) using distributed data sources, with Low-Rank Adaptation (LoRA) gaining popularity due to its low communication costs and strong performance. While recent work acknowledges the benefits of heterogeneous LoRA in FL and introduces flexible algorithms to support its implementation, our theoretical analysis reveals a critical gap: existing methods lack formal convergence guarantees due to parameter truncation and biased gradient updates. Specifically, adapting client-specific LoRA ranks necessitates truncating global parameters, which introduces inherent truncation errors and leads to subsequent inaccurate gradient updates that accumulate over training rounds, ultimately degrading performance. To address the above issues, we propose \textbf{FedHL}, a simple yet effective \textbf{Fed}erated Learning framework tailored for \textbf{H}eterogeneous \textbf{L}oRA. By leveraging the full-rank global model as a calibrated aggregation basis, FedHL eliminates the direct truncation bias from initial alignment with client-specific ranks. Furthermore, we derive the theoretically optimal aggregation weights by minimizing the gradient drift term in the convergence upper bound. Our analysis shows that FedHL guarantees \mathcal{O}(1/\sqrt{T}) convergence rate, and experiments on multiple real-world datasets demonstrate a 1-3% improvement over several state-of-the-art methods.
摘要
联邦学习(FL)支持利用分布式数据源对基础模型(FM)进行微调,其中低秩自适应(LoRA)因其低通信成本和优异性能而广受关注。尽管近期研究认识到异构LoRA在FL中的优势,并提出了灵活算法支持其实现,但我们的理论分析揭示了一个关键缺陷:现有方法由于参数截断和梯度更新偏差而缺乏形式化收敛保证。具体而言,为适应客户端特定的LoRA秩,需对全局参数进行截断,这会引入固有截断误差,并导致后续梯度更新不准确,这些误差在训练轮次中不断累积,最终降低模型性能。为解决上述问题,我们提出\textbf{FedHL}——一个简单而有效的、专为\textbf{异构}\textbf{LoRA}设计的\textbf{联邦学习}框架。该方法通过将全秩全局模型作为校准聚合基准,消除了与客户端特定秩初始对齐时的直接截断偏差。此外,我们通过最小化收敛上界中的梯度漂移项,推导出理论最优聚合权重。分析表明FedHL可保证\mathcal{O}(1/\sqrt{T})的收敛速率,在多个真实数据集上的实验显示其性能较现有最优方法提升1-3%。
CLaDMoP: Learning Transferrable Models from Successful Clinical Trials via LLMs
Abstract
arXiv:2505.18527v1 Announce Type: cross Abstract: Many existing models for clinical trial outcome prediction are optimized using task-specific loss functions on trial phase-specific data. While this scheme may boost prediction for common diseases and drugs, it can hinder learning of generalizable representations, leading to more false positives/negatives. To address this limitation, we introduce CLaDMoP, a new pre-training approach for clinical trial outcome prediction, alongside the Successful Clinical Trials dataset(SCT), specifically designed for this task. CLaDMoP leverages a Large Language Model-to encode trials' eligibility criteria-linked to a lightweight Drug-Molecule branch through a novel multi-level fusion technique. To efficiently fuse long embeddings across levels, we incorporate a grouping block, drastically reducing computational overhead. CLaDMoP avoids reliance on task-specific objectives by pre-training on a "pair matching" proxy task. Compared to established zero-shot and few-shot baselines, our method significantly improves both PR-AUC and ROC-AUC, especially for phase I and phase II trials. We further evaluate and perform ablation on CLaDMoP after Parameter-Efficient Fine-Tuning, comparing it to state-of-the-art supervised baselines, including MEXA-CTP, on the Trial Outcome Prediction(TOP) benchmark. CLaDMoP achieves up to 10.5% improvement in PR-AUC and 3.6% in ROC-AUC, while attaining comparable F1 score to MEXA-CTP, highlighting its potential for clinical trial outcome prediction. Code and SCT dataset can be downloaded from https://github.com/murai-lab/CLaDMoP.
摘要
现有许多临床试验结果预测模型通过在特定试验阶段数据上使用任务专用损失函数进行优化。尽管这种方案可能提升常见疾病和药物的预测效果,但会阻碍可泛化表征的学习,导致更多假阳性/假阴性结果。为克服这一局限,我们提出CLaDMoP——一种新的临床试验结果预训练方法,并为此专门构建了成功临床试验数据集(SCT)。CLaDMoP利用大型语言模型编码试验的入选标准,通过新型多层次融合技术将其与轻量级药物分子分支相连接。为高效融合跨层次的长嵌入向量,我们引入分组模块,显著降低计算开销。该方法通过"配对匹配"代理任务进行预训练,避免依赖任务特定目标。相较于成熟的零样本和小样本基线,我们的方法在PR-AUC和ROC-AUC指标上均取得显著提升,尤其在一期和二期临床试验中表现突出。我们在参数高效微调后对CLaDMoP进行评估和消融实验,与包括MEXA-CTP在内的监督学习基线在试验结果预测(TOP)基准上进行比较。CLaDMoP实现PR-AUC最高提升10.5%、ROC-AUC提升3.6%,同时获得与MEXA-CTP相当的F1分数,展现了其在临床试验结果预测中的应用潜力。代码和SCT数据集可从https://github.com/murai-lab/CLaDMoP下载。
Reinforcement Fine-Tuning Powers Reasoning Capability of Multimodal Large Language Models
Abstract
arXiv:2505.18536v1 Announce Type: cross Abstract: Standing in 2025, at a critical juncture in the pursuit of Artificial General Intelligence (AGI), reinforcement fine-tuning (RFT) has demonstrated significant potential in enhancing the reasoning capability of large language models (LLMs) and has led to the development of cutting-edge AI models such as OpenAI-o1 and DeepSeek-R1. Moreover, the efficient application of RFT to enhance the reasoning capability of multimodal large language models (MLLMs) has attracted widespread attention from the community. In this position paper, we argue that reinforcement fine-tuning powers the reasoning capability of multimodal large language models. To begin with, we provide a detailed introduction to the fundamental background knowledge that researchers interested in this field should be familiar with. Furthermore, we meticulously summarize the improvements of RFT in powering reasoning capability of MLLMs into five key points: diverse modalities, diverse tasks and domains, better training algorithms, abundant benchmarks and thriving engineering frameworks. Finally, we propose five promising directions for future research that the community might consider. We hope that this position paper will provide valuable insights to the community at this pivotal stage in the advancement toward AGI. Summary of works done on RFT for MLLMs is available at https://github.com/Sun-Haoyuan23/Awesome-RL-based-Reasoning-MLLMs.
摘要
站在2025年这一追求通用人工智能(AGI)的关键节点,强化微调(RFT)技术在提升大语言模型(LLMs)推理能力方面已展现出显著潜力,并催生了OpenAI-o1与DeepSeek-R1等尖端AI模型。更值得注意的是,RFT在增强多模态大语言模型(MLLMs)推理能力方面的有效应用已引发学界广泛关注。本立场文件论证了强化微调技术对多模态大语言模型推理能力的赋能作用。首先,我们系统介绍了该领域研究者应掌握的基础背景知识;进而将RFT提升MLLMs推理能力的进展精炼为五大要点:多模态支持、多任务与多领域适应、优化训练算法、丰富基准测试体系及蓬勃发展的工程框架;最后提出了五个值得学界探索的未来研究方向。我们期待这份立场文件能为AGI发展关键阶段的学术共同体提供有价值的洞见。RFT应用于MLLMs的研究成果汇总详见https://github.com/Sun-Haoyuan23/Awesome-RL-based-Reasoning-MLLMs。
Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation
Abstract
arXiv:2505.18556v1 Announce Type: cross Abstract: Intent detection, a core component of natural language understanding, has considerably evolved as a crucial mechanism in safeguarding large language models (LLMs). While prior work has applied intent detection to enhance LLMs' moderation guardrails, showing a significant success against content-level jailbreaks, the robustness of these intent-aware guardrails under malicious manipulations remains under-explored. In this work, we investigate the vulnerability of intent-aware guardrails and demonstrate that LLMs exhibit implicit intent detection capabilities. We propose a two-stage intent-based prompt-refinement framework, IntentPrompt, that first transforms harmful inquiries into structured outlines and further reframes them into declarative-style narratives by iteratively optimizing prompts via feedback loops to enhance jailbreak success for red-teaming purposes. Extensive experiments across four public benchmarks and various black-box LLMs indicate that our framework consistently outperforms several cutting-edge jailbreak methods and evades even advanced Intent Analysis (IA) and Chain-of-Thought (CoT)-based defenses. Specifically, our "FSTR+SPIN" variant achieves attack success rates ranging from 88.25% to 96.54% against CoT-based defenses on the o1 model, and from 86.75% to 97.12% on the GPT-4o model under IA-based defenses. These findings highlight a critical weakness in LLMs' safety mechanisms and suggest that intent manipulation poses a growing challenge to content moderation guardrails.
摘要
意图检测作为自然语言理解的核心组件,已发展成为保护大语言模型(LLMs)安全的关键机制。尽管先前研究已应用意图检测来强化LLMs的内容审核护栏,并在防御内容层面越狱攻击方面取得显著成效,但这些意图感知护栏在恶意操纵下的鲁棒性仍未得到充分探索。本研究揭示了意图感知护栏的脆弱性,并证明LLMs具有隐式意图检测能力。我们提出了一种两阶段基于意图的提示优化框架IntentPrompt:首先将有害查询转化为结构化纲要,继而通过反馈循环迭代优化提示,将其重构为陈述式叙述以提升红队测试的越狱成功率。在四个公开基准测试和多种黑盒LLMs上的大量实验表明,本框架持续优于多种前沿越狱方法,并能规避包括意图分析(IA)和思维链(CoT)在内的先进防御机制。具体而言,我们的"FSTR+SPIN"变体在o1模型上对CoT防御的攻击成功率达88.25%至96.54%,在GPT-4o模型上对IA防御的攻击成功率达86.75%至97.12%。这些发现揭示了LLMs安全机制的关键弱点,表明意图操纵对内容审核护栏构成日益严峻的挑战。
Removal of Hallucination on Hallucination: Debate-Augmented RAG
Abstract
arXiv:2505.18581v1 Announce Type: cross Abstract: Retrieval-Augmented Generation (RAG) enhances factual accuracy by integrating external knowledge, yet it introduces a critical issue: erroneous or biased retrieval can mislead generation, compounding hallucinations, a phenomenon we term Hallucination on Hallucination. To address this, we propose Debate-Augmented RAG (DRAG), a training-free framework that integrates Multi-Agent Debate (MAD) mechanisms into both retrieval and generation stages. In retrieval, DRAG employs structured debates among proponents, opponents, and judges to refine retrieval quality and ensure factual reliability. In generation, DRAG introduces asymmetric information roles and adversarial debates, enhancing reasoning robustness and mitigating factual inconsistencies. Evaluations across multiple tasks demonstrate that DRAG improves retrieval reliability, reduces RAG-induced hallucinations, and significantly enhances overall factual accuracy. Our code is available at https://github.com/Huenao/Debate-Augmented-RAG.
摘要
检索增强生成(RAG)通过整合外部知识提升事实准确性,但引入了一个关键问题:错误或有偏见的检索可能误导生成过程,加剧幻觉现象,我们称之为"幻觉叠加"。为解决这一问题,我们提出辩论增强RAG(DRAG),这是一种无需训练的框架,将多智能体辩论(MAD)机制整合到检索和生成阶段。在检索阶段,DRAG采用支持者、反对者和裁判的结构化辩论机制,优化检索质量并确保事实可靠性。在生成阶段,DRAG引入非对称信息角色和对抗性辩论,增强推理鲁棒性并减少事实不一致性。多任务评估表明,DRAG能提高检索可靠性,减少RAG引发的幻觉,并显著提升整体事实准确性。我们的代码发布于https://github.com/Huenao/Debate-Augmented-RAG。
Safety Alignment via Constrained Knowledge Unlearning
Abstract
arXiv:2505.18588v1 Announce Type: cross Abstract: Despite significant progress in safety alignment, large language models (LLMs) remain susceptible to jailbreak attacks. Existing defense mechanisms have not fully deleted harmful knowledge in LLMs, which allows such attacks to bypass safeguards and produce harmful outputs. To address this challenge, we propose a novel safety alignment strategy, Constrained Knowledge Unlearning (CKU), which focuses on two primary objectives: knowledge localization and retention, and unlearning harmful knowledge. CKU works by scoring neurons in specific multilayer perceptron (MLP) layers to identify a subset U of neurons associated with useful knowledge. During the unlearning process, CKU prunes the gradients of neurons in U to preserve valuable knowledge while effectively mitigating harmful content. Experimental results demonstrate that CKU significantly enhances model safety without compromising overall performance, offering a superior balance between safety and utility compared to existing methods. Additionally, our analysis of neuron knowledge sensitivity across various MLP layers provides valuable insights into the mechanics of safety alignment and model knowledge editing.
摘要
尽管在安全对齐方面取得了显著进展,大型语言模型(LLM)仍易受越狱攻击影响。现有防御机制未能完全消除模型中的有害知识,导致攻击者可绕过防护措施生成有害输出。针对这一挑战,我们提出了一种新颖的安全对齐策略——约束性知识遗忘(CKU),该策略聚焦两大目标:知识定位与保留,以及有害知识遗忘。CKU通过为特定多层感知机(MLP)层中的神经元评分,识别出与有用知识相关的神经元子集U。在遗忘过程中,CKU对U中神经元的梯度进行剪枝,在有效消除有害内容的同时保留有价值的知识。实验结果表明,CKU能在不影响模型整体性能的前提下显著提升安全性,相比现有方法实现了安全性与实用性的更优平衡。此外,我们对不同MLP层神经元知识敏感度的分析,为安全对齐和模型知识编辑的机制提供了重要见解。
MisoDICE: Multi-Agent Imitation from Unlabeled Mixed-Quality Demonstrations
Abstract
arXiv:2505.18595v1 Announce Type: cross Abstract: We study offline imitation learning (IL) in cooperative multi-agent settings, where demonstrations have unlabeled mixed quality - containing both expert and suboptimal trajectories. Our proposed solution is structured in two stages: trajectory labeling and multi-agent imitation learning, designed jointly to enable effective learning from heterogeneous, unlabeled data. In the first stage, we combine advances in large language models and preference-based reinforcement learning to construct a progressive labeling pipeline that distinguishes expert-quality trajectories. In the second stage, we introduce MisoDICE, a novel multi-agent IL algorithm that leverages these labels to learn robust policies while addressing the computational complexity of large joint state-action spaces. By extending the popular single-agent DICE framework to multi-agent settings with a new value decomposition and mixing architecture, our method yields a convex policy optimization objective and ensures consistency between global and local policies. We evaluate MisoDICE on multiple standard multi-agent RL benchmarks and demonstrate superior performance, especially when expert data is scarce.
摘要
我们研究合作多智能体环境下的离线模仿学习(IL),其中演示数据包含未标注的混合质量轨迹——既有专家级也有次优轨迹。提出的解决方案采用两阶段结构:轨迹标注和多智能体模仿学习,通过联合设计实现从异构未标注数据中有效学习。第一阶段结合大型语言模型和基于偏好的强化学习技术,构建渐进式标注流程以识别专家级轨迹。第二阶段提出MisoDICE算法,这是一种新型多智能体IL方法,利用标注信息学习鲁棒策略,同时解决大规模联合状态-动作空间的计算复杂度问题。通过将流行的单智能体DICE框架扩展至多智能体场景,并采用新的价值分解与混合架构,我们的方法产生了凸策略优化目标,确保全局与局部策略的一致性。在多个标准多智能体强化学习基准测试中评估MisoDICE,结果表明其性能优越,尤其在专家数据稀缺时表现突出。
Autocomp: LLM-Driven Code Optimization for Tensor Accelerators
Abstract
arXiv:2505.18574v1 Announce Type: cross Abstract: Hardware accelerators, especially those designed for tensor processing, have become ubiquitous in today's computing landscape. However, even with significant efforts in building compilers, programming these tensor accelerators remains challenging, leaving much of their potential underutilized. Recently, large language models (LLMs), trained on large amounts of code, have shown significant promise in code generation and optimization tasks, but generating low-resource languages like specialized tensor accelerator code still poses a significant challenge. We tackle this challenge with Autocomp, an approach that empowers accelerator programmers to leverage domain knowledge and hardware feedback to optimize code via an automated LLM-driven search. We accomplish this by: 1) formulating each optimization pass as a structured two-phase prompt, divided into planning and code generation phases, 2) inserting domain knowledge during planning via a concise and adaptable optimization menu, and 3) integrating correctness and performance metrics from hardware as feedback at each search iteration. Across three categories of representative workloads and two different accelerators, we demonstrate that Autocomp-optimized code runs 5.6x (GEMM) and 2.7x (convolution) faster than the vendor-provided library, and outperforms expert-level hand-tuned code by 1.4x (GEMM), 1.1x (convolution), and 1.3x (fine-grained linear algebra). Additionally, we demonstrate that optimization schedules generated from Autocomp can be reused across similar tensor operations, improving speedups by up to 24% under a fixed sample budget.
摘要
硬件加速器,尤其是专为张量处理设计的加速器,已在当今计算领域无处不在。然而,尽管在编译器构建方面投入了大量努力,对这些张量加速器进行编程仍然具有挑战性,导致其潜力远未得到充分利用。近期,基于海量代码训练的大型语言模型(LLMs)在代码生成与优化任务中展现出显著潜力,但生成专用张量加速器代码等低资源语言仍面临重大挑战。我们提出Autocomp方法应对这一挑战,该方法使加速器程序员能够利用领域知识和硬件反馈,通过自动化LLM驱动搜索优化代码。具体实现包括:1)将每个优化过程构建为结构化的两阶段提示(规划阶段与代码生成阶段),2)在规划阶段通过简洁可适配的优化菜单注入领域知识,3)在每次搜索迭代中整合来自硬件的正确性指标与性能指标作为反馈。在三大类典型工作负载和两种不同加速器上的实验表明,经Autocomp优化的代码运行速度比厂商提供的库快5.6倍(GEMM)和2.7倍(卷积),并分别以1.4倍(GEMM)、1.1倍(卷积)和1.3倍(细粒度线性代数)的优势超越专家级手工调优代码。此外,我们证明Autocomp生成的优化方案可在相似张量运算中复用,在固定样本预算下将加速效果提升最高达24%。
Debate-to-Detect: Reformulating Misinformation Detection as a Real-World Debate with Large Language Models
Abstract
arXiv:2505.18596v1 Announce Type: cross Abstract: The proliferation of misinformation in digital platforms reveals the limitations of traditional detection methods, which mostly rely on static classification and fail to capture the intricate process of real-world fact-checking. Despite advancements in Large Language Models (LLMs) that enhance automated reasoning, their application to misinformation detection remains hindered by issues of logical inconsistency and superficial verification. In response, we introduce Debate-to-Detect (D2D), a novel Multi-Agent Debate (MAD) framework that reformulates misinformation detection as a structured adversarial debate. Inspired by fact-checking workflows, D2D assigns domain-specific profiles to each agent and orchestrates a five-stage debate process, including Opening Statement, Rebuttal, Free Debate, Closing Statement, and Judgment. To transcend traditional binary classification, D2D introduces a multi-dimensional evaluation mechanism that assesses each claim across five distinct dimensions: Factuality, Source Reliability, Reasoning Quality, Clarity, and Ethics. Experiments with GPT-4o on two fakenews datasets demonstrate significant improvements over baseline methods, and the case study highlight D2D's capability to iteratively refine evidence while improving decision transparency, representing a substantial advancement towards robust and interpretable misinformation detection. The code will be open-sourced in a future release.
摘要
数字平台中虚假信息的泛滥暴露了传统检测方法的局限性,这些方法主要依赖静态分类,无法捕捉现实世界事实核查的复杂过程。尽管大型语言模型(LLMs)的进步增强了自动推理能力,但其在虚假信息检测中的应用仍受困于逻辑不一致性和表面化验证等问题。为此,我们提出"辩论式检测"(Debate-to-Detect,D2D)——一种新颖的多智能体辩论框架,将虚假信息检测重构为结构化对抗辩论。受事实核查工作流程启发,D2D为每个智能体分配特定领域角色,并设计五阶段辩论流程:开场陈述、反驳、自由辩论、结辩陈述和裁决。为超越传统二元分类,D2D引入多维评估机制,从五个维度评估每项主张:事实性、信源可靠性、推理质量、清晰度和伦理合规性。基于GPT-4o在两个假新闻数据集上的实验表明,该方法较基线有显著提升,案例研究凸显D2D能迭代优化证据并提高决策透明度,标志着向稳健可解释的虚假信息检测迈出重要一步。代码将于后续版本开源。
Rethinking Causal Mask Attention for Vision-Language Inference
Abstract
arXiv:2505.18605v1 Announce Type: cross Abstract: Causal attention has become a foundational mechanism in autoregressive vision-language models (VLMs), unifying textual and visual inputs under a single generative framework. However, existing causal mask-based strategies are inherited from large language models (LLMs) where they are tailored for text-only decoding, and their adaptation to vision tokens is insufficiently addressed in the prefill stage. Strictly masking future positions for vision queries introduces overly rigid constraints, which hinder the model's ability to leverage future context that often contains essential semantic cues for accurate inference. In this work, we empirically investigate how different causal masking strategies affect vision-language inference and then propose a family of future-aware attentions tailored for this setting. We first empirically analyze the effect of previewing future tokens for vision queries and demonstrate that rigid masking undermines the model's capacity to capture useful contextual semantic representations. Based on these findings, we propose a lightweight attention family that aggregates future visual context into past representations via pooling, effectively preserving the autoregressive structure while enhancing cross-token dependencies. We evaluate a range of causal masks across diverse vision-language inference settings and show that selectively compressing future semantic context into past representations benefits the inference.
摘要
因果注意力已成为自回归视觉语言模型(VLM)的基础机制,将文本与视觉输入统一在单一生成框架下。然而,现有基于因果掩码的策略继承自纯文本解码的大语言模型(LLM),其在预填充阶段对视觉标记的适应性处理不足。对视觉查询严格屏蔽未来位置会引入过度刚性约束,阻碍模型利用常含关键语义线索的未来上下文进行准确推理。本研究通过实证探讨不同因果掩码策略如何影响视觉语言推理,进而提出适用于该场景的未来感知注意力机制家族。我们首先实证分析了视觉查询中预览未来标记的效果,证明刚性掩码会削弱模型捕获有用上下文语义表征的能力。基于这些发现,我们提出一种轻量级注意力家族,通过池化将未来视觉上下文聚合到历史表征中,在保持自回归结构的同时增强跨标记依赖性。我们在多样化视觉语言推理场景中评估了多种因果掩码,结果表明有选择地将未来语义上下文压缩至历史表征有利于提升推理性能。
LLM-Meta-SR: Learning to Evolve Selection Operators for Symbolic Regression
Abstract
arXiv:2505.18602v1 Announce Type: cross Abstract: Large language models (LLMs) have revolutionized algorithm development, yet their application in symbolic regression, where algorithms automatically discover symbolic expressions from data, remains constrained and is typically designed manually by human experts. In this paper, we propose a learning-to-evolve framework that enables LLMs to automatically design selection operators for evolutionary symbolic regression algorithms. We first identify two key limitations in existing LLM-based algorithm evolution techniques: code bloat and a lack of semantic guidance. Bloat results in unnecessarily complex components, and the absence of semantic awareness can lead to ineffective exchange of useful code components, both of which can reduce the interpretability of the designed algorithm or hinder evolutionary learning progress. To address these issues, we enhance the LLM-based evolution framework for meta symbolic regression with two key innovations: bloat control and a complementary, semantics-aware selection operator. Additionally, we embed domain knowledge into the prompt, enabling the LLM to generate more effective and contextually relevant selection operators. Our experimental results on symbolic regression benchmarks show that LLMs can devise selection operators that outperform nine expert-designed baselines, achieving state-of-the-art performance. This demonstrates that LLMs can exceed expert-level algorithm design for symbolic regression.
摘要
大语言模型(LLMs)已经彻底改变了算法开发的范式,但其在符号回归(即算法从数据中自动发现符号表达式)中的应用仍受限制,且通常由人类专家手动设计。本文提出一种"学习进化"框架,使LLMs能够自动为进化式符号回归算法设计选择算子。我们首先指出现有基于LLM的算法进化技术存在两个关键局限:代码膨胀和语义引导缺失。代码膨胀会导致生成不必要的复杂组件,而语义意识的缺乏可能阻碍有效代码组件的交换,这两者都会降低所设计算法的可解释性或阻碍进化学习进程。为解决这些问题,我们通过两项关键创新增强了基于LLM的元符号回归进化框架:膨胀控制和互补的语义感知选择算子。此外,我们将领域知识嵌入提示词中,使LLM能生成更有效且符合上下文的选择算子。在符号回归基准测试中的实验结果表明,LLMs设计的选择算子性能优于九种专家设计的基线方法,达到了最先进的水平。这证明LLMs在符号回归领域的算法设计能力可以超越专家水平。
DDO: Dual-Decision Optimization via Multi-Agent Collaboration for LLM-Based Medical Consultation
Abstract
arXiv:2505.18630v1 Announce Type: cross Abstract: Large Language Models (LLMs) demonstrate strong generalization and reasoning abilities, making them well-suited for complex decision-making tasks such as medical consultation (MC). However, existing LLM-based methods often fail to capture the dual nature of MC, which entails two distinct sub-tasks: symptom inquiry, a sequential decision-making process, and disease diagnosis, a classification problem. This mismatch often results in ineffective symptom inquiry and unreliable disease diagnosis. To address this, we propose \textbf{DDO}, a novel LLM-based framework that performs \textbf{D}ual-\textbf{D}ecision \textbf{O}ptimization by decoupling and independently optimizing the the two sub-tasks through a collaborative multi-agent workflow. Experiments on three real-world MC datasets show that DDO consistently outperforms existing LLM-based approaches and achieves competitive performance with state-of-the-art generation-based methods, demonstrating its effectiveness in the MC task.
摘要
大语言模型(LLMs)展现出强大的泛化与推理能力,使其特别适合医疗咨询(MC)等复杂决策任务。然而,现有基于LLM的方法往往未能捕捉MC的双重特性——该任务包含两个截然不同的子任务:作为序列决策过程的症状问询,以及作为分类问题的疾病诊断。这种失配常导致症状问询低效与疾病诊断不可靠。为此,我们提出DDO框架,通过协作式多智能体工作流对两个子任务进行解耦与独立优化,实现双决策优化。在三个真实世界MC数据集上的实验表明,DDO始终优于现有基于LLM的方法,并与最先进的生成式方法达到相当性能,验证了其在MC任务中的有效性。
Flex-Judge: Think Once, Judge Anywhere
Abstract
arXiv:2505.18601v1 Announce Type: cross Abstract: Human-generated reward signals are critical for aligning generative models with human preferences, guiding both training and inference-time evaluations. While large language models (LLMs) employed as proxy evaluators, i.e., LLM-as-a-Judge, significantly reduce the costs associated with manual annotations, they typically require extensive modality-specific training data and fail to generalize well across diverse multimodal tasks. In this paper, we propose Flex-Judge, a reasoning-guided multimodal judge model that leverages minimal textual reasoning data to robustly generalize across multiple modalities and evaluation formats. Our core intuition is that structured textual reasoning explanations inherently encode generalizable decision-making patterns, enabling an effective transfer to multimodal judgments, e.g., with images or videos. Empirical results demonstrate that Flex-Judge, despite being trained on significantly fewer text data, achieves competitive or superior performance compared to state-of-the-art commercial APIs and extensively trained multimodal evaluators. Notably, Flex-Judge presents broad impact in modalities like molecule, where comprehensive evaluation benchmarks are scarce, underscoring its practical value in resource-constrained domains. Our framework highlights reasoning-based text supervision as a powerful, cost-effective alternative to traditional annotation-intensive approaches, substantially advancing scalable multimodal model-as-a-judge.
摘要
人类生成的奖励信号对于将生成模型与人类偏好对齐、指导训练及推理时评估至关重要。虽然采用大语言模型(LLMs)作为代理评估器(即LLM-as-a-Judge)能显著降低人工标注成本,但这些模型通常需要大量特定模态的训练数据,且难以在多样化多模态任务中实现良好泛化。本文提出Flex-Judge——一种基于推理引导的多模态评判模型,该模型利用极少量文本推理数据即可在多种模态和评估格式间实现稳健泛化。我们的核心观点是:结构化文本推理解释本身编码了可泛化的决策模式,从而能有效迁移至图像或视频等多模态评判任务。实验结果表明,Flex-Judge尽管仅使用显著更少的文本数据进行训练,其性能仍可与最先进的商业API及经过大量训练的多模态评估器相媲美甚至更优。值得注意的是,Flex-Judge在分子等缺乏全面评估基准的模态中展现出广泛影响力,凸显了其在资源受限领域的实用价值。该框架表明,基于推理的文本监督可作为传统高成本标注方法的强效替代方案,为可扩展的多模态模型即评判者(model-as-a-judge)提供了重要推进。
SEW: Self-Evolving Agentic Workflows for Automated Code Generation
Abstract
arXiv:2505.18646v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated effectiveness in code generation tasks. To enable LLMs to address more complex coding challenges, existing research has focused on crafting multi-agent systems with agentic workflows, where complex coding tasks are decomposed into sub-tasks, assigned to specialized agents. Despite their effectiveness, current approaches heavily rely on hand-crafted agentic workflows, with both agent topologies and prompts manually designed, which limits their ability to automatically adapt to different types of coding problems. To address these limitations and enable automated workflow design, we propose \textbf{S}elf-\textbf{E}volving \textbf{W}orkflow (\textbf{SEW}), a novel self-evolving framework that automatically generates and optimises multi-agent workflows. Extensive experiments on three coding benchmark datasets, including the challenging LiveCodeBench, demonstrate that our SEW can automatically design agentic workflows and optimise them through self-evolution, bringing up to 33% improvement on LiveCodeBench compared to using the backbone LLM only. Furthermore, by investigating different representation schemes of workflow, we provide insights into the optimal way to encode workflow information with text.
摘要
大语言模型(LLMs)在代码生成任务中已展现出显著成效。为使LLMs能够应对更复杂的编程挑战,现有研究致力于构建具有代理工作流的多智能体系统,将复杂编码任务分解为子任务并分配给专业化代理。尽管这些方法有效,当前方案仍严重依赖手工设计的代理工作流,其智能体拓扑结构和提示词均为人工设定,这限制了其自动适应不同类型编码问题的能力。为解决这些局限并实现工作流自动设计,我们提出\textbf{自进化工作流(SEW)},这是一种能自动生成并优化多智能体工作流的新型自进化框架。在三个代码基准数据集(包括高难度的LiveCodeBench)上的大量实验表明,我们的SEW能通过自主进化设计并优化代理工作流,相比仅使用骨干LLM,在LiveCodeBench上最高可带来33%的性能提升。此外,通过探究工作流的不同表示方案,我们为文本编码工作流信息的最优方式提供了理论依据。
Robustness in Large Language Models: A Survey of Mitigation Strategies and Evaluation Metrics
Abstract
arXiv:2505.18658v1 Announce Type: cross Abstract: Large Language Models (LLMs) have emerged as a promising cornerstone for the development of natural language processing (NLP) and artificial intelligence (AI). However, ensuring the robustness of LLMs remains a critical challenge. To address these challenges and advance the field, this survey provides a comprehensive overview of current studies in this area. First, we systematically examine the nature of robustness in LLMs, including its conceptual foundations, the importance of consistent performance across diverse inputs, and the implications of failure modes in real-world applications. Next, we analyze the sources of non-robustness, categorizing intrinsic model limitations, data-driven vulnerabilities, and external adversarial factors that compromise reliability. Following this, we review state-of-the-art mitigation strategies, and then we discuss widely adopted benchmarks, emerging metrics, and persistent gaps in assessing real-world reliability. Finally, we synthesize findings from existing surveys and interdisciplinary studies to highlight trends, unresolved issues, and pathways for future research.
摘要
大型语言模型(LLMs)已成为推动自然语言处理(NLP)和人工智能(AI)发展的关键基石。然而,确保其鲁棒性仍是重要挑战。为应对这些问题并推动领域进展,本综述对该领域现有研究进行了全面梳理。首先,我们系统性地探讨了LLMs鲁棒性的本质,包括其概念基础、多样化输入下保持性能一致的重要性,以及实际应用中失效模式的影响。其次,我们分析了非鲁棒性的来源,将其归类为内在模型局限、数据驱动的脆弱性,以及影响可靠性的外部对抗因素。随后,我们综述了前沿的缓解策略,进而讨论了广泛采用的基准测试、新兴评估指标及现实场景可靠性评估中存在的持续缺陷。最后,通过整合现有综述与跨学科研究成果,我们揭示了当前趋势、待解难题以及未来研究的潜在路径。
Large Language Models in the Task of Automatic Validation of Text Classifier Predictions
Abstract
arXiv:2505.18688v1 Announce Type: cross Abstract: Machine learning models for text classification are trained to predict a class for a given text. To do this, training and validation samples must be prepared: a set of texts is collected, and each text is assigned a class. These classes are usually assigned by human annotators with different expertise levels, depending on the specific classification task. Collecting such samples from scratch is labor-intensive because it requires finding specialists and compensating them for their work; moreover, the number of available specialists is limited, and their productivity is constrained by human factors. While it may not be too resource-intensive to collect samples once, the ongoing need to retrain models (especially in incremental learning pipelines) to address data drift (also called model drift) makes the data collection process crucial and costly over the model's entire lifecycle. This paper proposes several approaches to replace human annotators with Large Language Models (LLMs) to test classifier predictions for correctness, helping ensure model quality and support high-quality incremental learning.
摘要
文本分类的机器学习模型通过训练预测给定文本的类别。为此需要准备训练和验证样本:收集文本集合并为每篇文本标注类别。这些类别通常由不同专业水平的人工标注者根据具体分类任务进行标注。从头开始收集此类样本需要耗费大量人力,因为需寻找专家并支付报酬;此外可用专家数量有限,且其生产力受人因因素制约。虽然单次样本收集可能资源消耗不大,但为解决数据漂移(亦称模型漂移)而持续进行的模型重训练(特别是在增量学习流程中),使得数据收集过程在模型整个生命周期中至关重要且成本高昂。本文提出用大语言模型替代人工标注者的若干方法,以测试分类器预测的正确性,从而保障模型质量并支持高质量的增量学习。
ThanoRA: Task Heterogeneity-Aware Multi-Task Low-Rank Adaptation
Abstract
arXiv:2505.18640v1 Announce Type: cross Abstract: Low-Rank Adaptation (LoRA) is widely adopted for downstream fine-tuning of foundation models due to its efficiency and zero additional inference cost. Many real-world applications require foundation models to specialize in multiple tasks simultaneously, motivating the need for efficient multi-task adaptation. While recent approaches integrate LoRA with mixture-of-experts (MoE) to address this, the use of routers prevents parameter mergeability, which increases inference overhead and hinders unified multi-task adaptation, thereby limiting deployment practicality. In this work, we propose ThanoRA, a Task Heterogeneity-Aware Multi-Task Low-Rank Adaptation framework that enables multi-task adaptation while preserving the inference efficiency of LoRA. ThanoRA jointly models task heterogeneity and mitigates subspace interference throughout training. Specifically, motivated by inherent differences in complexity and heterogeneity across tasks, ThanoRA constructs task-specific LoRA subspaces at initialization, enabling fine-grained knowledge injection aligned with task heterogeneity. Furthermore, to prevent task interference and subspace collapse during multi-task training, ThanoRA introduces a subspace-preserving regularization that maintains the independence of task-specific representations. With the synergy of both components, ThanoRA enables efficient and unified multi-task adaptation. Extensive experiments across multimodal and text-only benchmarks under varying multi-task mixtures demonstrate that ThanoRA consistently achieves robust and superior performance over strong baselines without introducing additional inference overhead. Our code is publicly available at: https://github.com/LiangJian24/ThanoRA.
摘要
低秩自适应(LoRA)因其高效性和零额外推理成本的优势,被广泛用于基础模型的下游微调。许多实际应用要求基础模型能同时适应多任务处理,这推动了对高效多任务自适应方法的需求。尽管近期研究尝试将LoRA与专家混合(MoE)相结合来解决这一问题,但路由器的使用导致参数无法合并,从而增加推理开销并阻碍统一的多任务自适应,限制了实际部署的可行性。本研究提出ThanoRA框架——一种任务异构感知的多任务低秩自适应方法,在保持LoRA推理效率的同时实现多任务自适应。ThanoRA通过联合建模任务异构性并在整个训练过程中减轻子空间干扰来实现这一目标。具体而言,基于任务间固有复杂度与异构性的差异,ThanoRA在初始化阶段构建任务特定的LoRA子空间,实现与任务异构性对齐的细粒度知识注入。此外,为防止多任务训练中的任务干扰和子空间坍缩,ThanoRA引入子空间保持正则化机制以维持任务特定表征的独立性。通过双组件的协同作用,ThanoRA实现了高效统一的多任务自适应。在多模态及纯文本基准测试上的大量实验表明,在不同多任务混合场景下,ThanoRA始终以稳健且优越的性能超越强基线方法,且未引入额外推理开销。代码已开源:https://github.com/LiangJian24/ThanoRA。
Adaptive Prediction-Powered AutoEval with Reliability and Efficiency Guarantees
Abstract
arXiv:2505.18659v1 Announce Type: cross Abstract: Selecting artificial intelligence (AI) models, such as large language models (LLMs), from multiple candidates requires accurate performance estimation. This is ideally achieved through empirical evaluations involving abundant real-world data. However, such evaluations are costly and impractical at scale. To address this challenge, autoevaluation methods leverage synthetic data produced by automated evaluators, such as LLMs-as-judges, reducing variance but potentially introducing bias. Recent approaches have employed semi-supervised prediction-powered inference (\texttt{PPI}) to correct for the bias of autoevaluators. However, the use of autoevaluators may lead in practice to a degradation in sample efficiency compared to conventional methods using only real-world data. In this paper, we propose \texttt{R-AutoEval+}, a novel framework that provides finite-sample reliability guarantees on the model evaluation, while also ensuring an enhanced (or at least no worse) sample efficiency compared to conventional methods. The key innovation of \texttt{R-AutoEval+} is an adaptive construction of the model evaluation variable, which dynamically tunes its reliance on synthetic data, reverting to conventional methods when the autoevaluator is insufficiently accurate. Experiments on the use of LLMs-as-judges for the optimization of quantization settings for the weights of an LLM, and for prompt design in LLMs confirm the reliability and efficiency of \texttt{R-AutoEval+}.
摘要
从多个候选模型(如大语言模型LLMs)中选择人工智能(AI)模型需要准确的性能评估。理想情况下,这应通过涉及大量真实世界数据的实证评估来实现。然而,此类评估成本高昂且难以大规模实施。为解决这一挑战,自动评估方法利用自动化评估器(如LLMs-as-judges)生成的合成数据来降低方差,但可能引入偏差。近期研究采用半监督预测驱动推断(\texttt{PPI})来校正自动评估器的偏差。然而在实际应用中,与仅使用真实数据的传统方法相比,自动评估器可能导致样本效率下降。本文提出\texttt{R-AutoEval+}框架,该框架在保证模型评估具有有限样本可靠性的同时,相较于传统方法能提升(或至少不降低)样本效率。\texttt{R-AutoEval+}的核心创新在于自适应构建模型评估变量,动态调整对合成数据的依赖程度,当自动评估器精度不足时自动回归传统方法。在LLM权重量化设置优化和LLM提示设计场景中使用LLMs-as-judges的实验证实了\texttt{R-AutoEval+}的可靠性与高效性。
Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps
Abstract
arXiv:2505.18675v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) have recently achieved significant progress in visual tasks, including semantic scene understanding and text-image alignment, with reasoning variants enhancing performance on complex tasks involving mathematics and logic. However, their capacity for reasoning tasks involving fine-grained visual understanding remains insufficiently evaluated. To address this gap, we introduce ReasonMap, a benchmark designed to assess the fine-grained visual understanding and spatial reasoning abilities of MLLMs. ReasonMap encompasses high-resolution transit maps from 30 cities across 13 countries and includes 1,008 question-answer pairs spanning two question types and three templates. Furthermore, we design a two-level evaluation pipeline that properly assesses answer correctness and quality. Comprehensive evaluations of 15 popular MLLMs, including both base and reasoning variants, reveal a counterintuitive pattern: among open-source models, base models outperform reasoning ones, while the opposite trend is observed in closed-source models. Additionally, performance generally degrades when visual inputs are masked, indicating that while MLLMs can leverage prior knowledge to answer some questions, fine-grained visual reasoning tasks still require genuine visual perception for strong performance. Our benchmark study offers new insights into visual reasoning and contributes to investigating the gap between open-source and closed-source models.
摘要
多模态大语言模型(MLLMs)近期在视觉任务中取得显著进展,涵盖语义场景理解和图文对齐等领域,其推理变体更在涉及数学与逻辑的复杂任务上表现出性能提升。然而,这些模型在需要细粒度视觉理解的推理任务中的能力尚未得到充分评估。为此,我们提出ReasonMap基准测试,旨在系统评估MLLMs的细粒度视觉理解与空间推理能力。该基准包含来自13个国家30个城市的高清交通路线图,共计1,008个涵盖两种问题类型和三种模板的问答对。我们进一步设计了两级评估流程,以准确评判答案的正确性与质量。通过对15个主流MLLMs(包括基础版与推理变体)的全面测试,发现一个反直觉现象:开源模型中基础版性能优于推理版,而闭源模型则呈现相反趋势。此外,当视觉输入被遮蔽时模型性能普遍下降,这表明尽管MLLMs能利用先验知识回答部分问题,但优秀的细粒度视觉推理仍需依赖真实的视觉感知。本研究为视觉推理领域提供了新见解,并为探索开源与闭源模型间的性能差距贡献了研究基础。
Steering LLM Reasoning Through Bias-Only Adaptation
Abstract
arXiv:2505.18706v1 Announce Type: cross Abstract: Recent work on reasoning-oriented language models, exemplified by o1-like systems, suggests that reinforcement-learning (RL) finetuning does not create new capabilities but instead strengthens reasoning patterns already latent in the pretrained network. We test this claim by training steering vectors: layer-wise biases that additively amplify selected hidden features while leaving all original weights unchanged. Experiments on four base models across the GSM8K and MATH benchmarks show that steering vectors recover, and in several cases exceed, the accuracy of fully-tuned counterparts. This result supports the view that the required reasoning skills pre-exist in the base model. Further, logit-lens analysis reveals that the trained vectors consistently boost token groups linked to structured languages and logical connectors, providing an interpretable account that aligns with the demands of quantitative reasoning tasks.
摘要
近期关于推理导向语言模型的研究(以o1类系统为例)表明,强化学习(RL)微调并不会创造新能力,而是强化了预训练网络中已有的潜在推理模式。我们通过训练导向向量(即逐层偏置项,以加法方式放大选定隐藏特征同时保持原始权重不变)来验证这一主张。在GSM8K和MATH基准测试中对四个基础模型进行的实验显示,导向向量恢复并在多个案例中超越了完全微调模型的准确率。这一结果支持了"所需推理技能已存在于基础模型中"的观点。此外,logit透镜分析表明,训练后的向量持续增强了与结构化语言和逻辑连接词相关的标记组,为定量推理任务的需求提供了可解释的依据。
Can LLMs Alleviate Catastrophic Forgetting in Graph Continual Learning? A Systematic Study
Abstract
arXiv:2505.18697v1 Announce Type: cross Abstract: Nowadays, real-world data, including graph-structure data, often arrives in a streaming manner, which means that learning systems need to continuously acquire new knowledge without forgetting previously learned information. Although substantial existing works attempt to address catastrophic forgetting in graph machine learning, they are all based on training from scratch with streaming data. With the rise of pretrained models, an increasing number of studies have leveraged their strong generalization ability for continual learning. Therefore, in this work, we attempt to answer whether large language models (LLMs) can mitigate catastrophic forgetting in Graph Continual Learning (GCL). We first point out that current experimental setups for GCL have significant flaws, as the evaluation stage may lead to task ID leakage. Then, we evaluate the performance of LLMs in more realistic scenarios and find that even minor modifications can lead to outstanding results. Finally, based on extensive experiments, we propose a simple-yet-effective method, Simple Graph Continual Learning (SimGCL), that surpasses the previous state-of-the-art GNN-based baseline by around 20% under the rehearsal-free constraint. To facilitate reproducibility, we have developed an easy-to-use benchmark LLM4GCL for training and evaluating existing GCL methods. The code is available at: https://github.com/ZhixunLEE/LLM4GCL.
摘要
当今世界,包括图结构数据在内的现实数据往往以流式方式到达,这意味着学习系统需要在不遗忘已掌握知识的前提下持续获取新信息。尽管现有大量研究致力于解决图机器学习中的灾难性遗忘问题,但这些方法均基于流式数据从头训练的范式。随着预训练模型的兴起,越来越多的研究利用其强大的泛化能力进行持续学习。为此,本研究旨在探究大型语言模型(LLMs)能否缓解图持续学习(GCL)中的灾难性遗忘问题。我们首先指出当前GCL实验设置存在重大缺陷,其评估阶段可能导致任务ID泄露。随后在更贴近现实的场景下评估LLMs性能,发现即使进行细微调整也能获得卓越效果。最终通过大量实验提出一种简单而有效的方法——简单图持续学习(SimGCL),在无排练约束条件下以约20%的优势超越此前最先进的基于图神经网络的基线方法。为促进可复现性研究,我们开发了易于使用的基准框架LLM4GCL,用于训练和评估现有GCL方法。代码已开源:https://github.com/ZhixunLEE/LLM4GCL。
GainRAG: Preference Alignment in Retrieval-Augmented Generation through Gain Signal Synthesis
Abstract
arXiv:2505.18710v1 Announce Type: cross Abstract: The Retrieval-Augmented Generation (RAG) framework introduces a retrieval module to dynamically inject retrieved information into the input context of large language models (LLMs), and has demonstrated significant success in various NLP tasks. However, the current study points out that there is a preference gap between retrievers and LLMs in the RAG framework, which limit the further improvement of system performance. Some highly relevant passages may interfere with LLM reasoning because they contain complex or contradictory information; while some indirectly related or even inaccurate content may help LLM generate more accurate answers by providing suggestive information or logical clues. To solve this, we propose GainRAG, a novel approach that aligns the retriever's and LLM's preferences by defining a new metric, "gain", which measure how well an input passage contributes to correct outputs. Specifically, we propose a method to estimate these gain signals and train a middleware that aligns the preferences of the retriever and the LLM using only limited data. In addition, we introduce a pseudo-passage strategy to mitigate degradation. The experimental results on 6 datasets verify the effectiveness of GainRAG.
摘要
检索增强生成(RAG)框架通过引入检索模块,将检索到的信息动态注入大型语言模型(LLM)的输入上下文,已在多种自然语言处理任务中展现出显著成效。然而,当前研究指出RAG框架中检索器与LLM之间存在偏好差异,这限制了系统性能的进一步提升。某些高相关性段落可能因包含复杂或矛盾信息而干扰LLM推理;而部分间接相关甚至不准确的内容,却可能通过提供提示性信息或逻辑线索帮助LLM生成更准确的答案。为此,我们提出GainRAG方法,通过定义新指标"增益"来衡量输入段落对正确输出的贡献程度,从而实现检索器与LLM的偏好对齐。具体而言,我们提出一种增益信号估计方法,并训练仅需有限数据即可实现两者偏好对齐的中间件。此外,我们引入伪段落策略以缓解性能退化问题。在6个数据集上的实验结果验证了GainRAG的有效性。
How Is LLM Reasoning Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark
Abstract
arXiv:2505.18761v1 Announce Type: cross Abstract: We introduce Grade School Math with Distracting Context (GSM-DC), a synthetic benchmark to evaluate Large Language Models' (LLMs) reasoning robustness against systematically controlled irrelevant context (IC). GSM-DC constructs symbolic reasoning graphs with precise distractor injections, enabling rigorous, reproducible evaluation. Our experiments demonstrate that LLMs are significantly sensitive to IC, affecting both reasoning path selection and arithmetic accuracy. Additionally, training models with strong distractors improves performance in both in-distribution and out-of-distribution scenarios. We further propose a stepwise tree search guided by a process reward model, which notably enhances robustness in out-of-distribution conditions.
摘要
我们提出了'含干扰情境的小学数学题'(GSM-DC)这一合成基准,用于评估大语言模型(LLM)在系统控制无关情境(IC)下的推理鲁棒性。GSM-DC通过构建符号化推理图并注入精确设计的干扰项,实现了严格且可复现的评估。实验表明,LLM对无关情境表现出显著敏感性,这种干扰既影响推理路径选择也降低算术准确性。此外,采用强干扰项进行模型训练可提升其在分布内和分布外场景的表现。我们进一步提出了一种基于过程奖励模型的逐步树搜索方法,该方法显著增强了模型在分布外条件下的鲁棒性。
Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization
Abstract
arXiv:2505.18720v1 Announce Type: cross Abstract: Direct Preference Optimization (DPO) has emerged as a promising framework for aligning Large Language Models (LLMs) with human preferences by directly optimizing the log-likelihood difference between chosen and rejected responses. However, existing methods assign equal importance to all tokens in the response, while humans focus on more meaningful parts. This leads to suboptimal preference optimization, as irrelevant or noisy tokens disproportionately influence DPO loss. To address this limitation, we propose \textbf{O}ptimal \textbf{T}ransport-based token weighting scheme for enhancing direct \textbf{P}reference \textbf{O}ptimization (OTPO). By emphasizing semantically meaningful token pairs and de-emphasizing less relevant ones, our method introduces a context-aware token weighting scheme that yields a more contrastive reward difference estimate. This adaptive weighting enhances reward stability, improves interpretability, and ensures that preference optimization focuses on meaningful differences between responses. Extensive experiments have validated OTPO's effectiveness in improving instruction-following ability across various settings\footnote{Code is available at https://github.com/Mimasss2/OTPO.}.
摘要
直接偏好优化(DPO)作为一种有前景的框架,通过直接优化选定响应与拒绝响应的对数似然差,实现了大型语言模型(LLM)与人类偏好的对齐。然而现有方法均等对待响应中的所有词元,而人类更关注具有实际意义的部分。这导致偏好优化效果欠佳,因为无关或噪声词元会对DPO损失产生不成比例的影响。为解决这一局限,我们提出基于最优传输的词元加权方案来增强直接偏好优化(OTPO)。通过强化语义重要词元对的权重并弱化相关性较低的部分,本方法引入了一种上下文感知的词元加权机制,从而产生更具对比性的奖励差异估计。这种自适应加权方式增强了奖励稳定性,提高了可解释性,并确保偏好优化聚焦于响应间有意义的差异。大量实验验证了OTPO在不同场景下提升指令跟随能力的有效性(代码详见https://github.com/Mimasss2/OTPO)。
VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning
Abstract
arXiv:2505.18719v1 Announce Type: cross Abstract: Recent high-capacity vision-language-action (VLA) models have demonstrated impressive performance on a range of robotic manipulation tasks by imitating human demonstrations. However, exploiting offline data with limited visited states will cause execution failure in out-of-distribution scenarios. Intuitively, an exploration-based method that improves on online collected data at test time could address this limitation. We present VLA-RL, an algorithmic and systematic framework that leverages online reinforcement learning (RL) to improve pretrained auto-regressive VLAs in downstream tasks. Within a unified perspective, we first introduce a trajectory-level RL formulation for auto-regressive VLA training, which models general robotic manipulation trajectory as multi-modal multi-turn conversation. To address the challenge of sparse rewards, we fine-tune a pretrained vision-language model as a robotic process reward model, which is trained on pseudo reward labels annotated on automatically extracted task segments. To scale up, we identify several implementation findings that improve the stability and efficiency including curriculum selection strategy, GPU-balanced vectorized environments, batch decoding, and critic warmup. VLA-RL enables OpenVLA-7B to surpass the strongest finetuned baseline by 4.5% on 40 challenging robotic manipulation tasks in LIBERO, and even matches the performance of advanced commercial models such as -FAST. Notably, we observe that VLA-RL benefits from increased test-time optimization, indicating an early spark of inference scaling laws in robotics.
摘要
近期的高容量视觉-语言-动作(VLA)模型通过模仿人类示范,在一系列机器人操作任务中展现出卓越性能。然而,利用状态覆盖有限的离线数据会导致分布外场景下的执行失败。直观上,一种基于探索的方法能够在测试时优化在线收集的数据,从而解决这一局限。我们提出VLA-RL算法与系统框架,该框架利用在线强化学习(RL)提升预训练自回归VLA模型在下游任务中的表现。在统一视角下,我们首先提出面向自回归VLA训练的轨迹级RL建模方法,将通用机器人操作轨迹视为多模态多轮对话。针对稀疏奖励的挑战,我们微调预训练视觉-语言模型作为机器 人流程奖励模型,其训练数据基于自动提取的任务片段生成的伪奖励标注。为实现规模化,我们提出了提升稳定性与效率的关键实现技术,包括课程选择策略、GPU负载均衡的向量化环境、批量解码以及评论家网络预热。VLA-RL使OpenVLA-7B模型在LIBERO基准的40项复杂机器人操作任务中超越最强微调基线4.5%,甚至媲美-FAST等先进商业模型性能。值得注意的是,我们发现VLA-RL能从测试时优化中持续获益,这预示着机器人领域推理缩放定律的早期萌芽。
LoTA-QAF: Lossless Ternary Adaptation for Quantization-Aware Fine-Tuning
Abstract
arXiv:2505.18724v1 Announce Type: cross Abstract: Quantization and fine-tuning are crucial for deploying large language models (LLMs) on resource-constrained edge devices. However, fine-tuning quantized models presents significant challenges, primarily stemming from: First, the mismatch in data types between the low-precision quantized weights (e.g., 4-bit) and the high-precision adaptation weights (e.g., 16-bit). This mismatch limits the computational efficiency advantage offered by quantized weights during inference. Second, potential accuracy degradation when merging these high-precision adaptation weights into the low-precision quantized weights, as the adaptation weights often necessitate approximation or truncation. Third, as far as we know, no existing methods support the lossless merging of adaptation while adjusting all quantized weights. To address these challenges, we introduce lossless ternary adaptation for quantization-aware fine-tuning (LoTA-QAF). This is a novel fine-tuning method specifically designed for quantized LLMs, enabling the lossless merging of ternary adaptation weights into quantized weights and the adjustment of all quantized weights. LoTA-QAF operates through a combination of: i) A custom-designed ternary adaptation (TA) that aligns ternary weights with the quantization grid and uses these ternary weights to adjust quantized weights. ii) A TA-based mechanism that enables the lossless merging of adaptation weights. iii) Ternary signed gradient descent (t-SignSGD) for updating the TA weights. We apply LoTA-QAF to Llama-3.1/3.3 and Qwen-2.5 model families and validate its effectiveness on several downstream tasks. On the MMLU benchmark, our method effectively recovers performance for quantized models, surpassing 16-bit LoRA by up to 5.14%. For task-specific fine-tuning, 16-bit LoRA achieves superior results, but LoTA-QAF still outperforms other methods.
摘要
量化与微调对于在资源受限的边缘设备上部署大语言模型(LLMs)至关重要。然而,量化模型的微调面临重大挑战,主要源于:首先,低精度量化权重(如4位)与高精度适配权重(如16位)之间的数据类型不匹配,这限制了量化权重在推理时提供的计算效率优势;其次,将这些高精度适配权重合并到低精度量化权重时可能导致精度下降,因为适配权重往往需要近似或截断处理;第三,据我们所知,现有方法均不支持在调整所有量化权重的同时实现适配权重的无损合并。为解决这些挑战,我们提出面向量化感知微调的无损三元适配方法(LoTA-QAF)。这是一种专为量化LLMs设计的新型微调方法,能够将三元适配权重无损合并到量化权重中并调整所有量化权重。LoTA-QAF通过以下组合实现:i) 定制设计的三元适配(TA),使三元权重与量化网格对齐,并利用这些三元权重调整量化权重;ii) 基于TA的机制实现适配权重的无损合并;iii) 用于更新TA权重的三元符号梯度下降(t-SignSGD)。我们将LoTA-QAF应用于Llama-3.1/3.3和Qwen-2.5模型系列,并在多个下游任务上验证其有效性。在MMLU基准测试中,我们的方法有效恢复了量化模型的性能,较16位LoRA最高提升5.14%。在任务特定微调方面,16位LoRA虽取得更优结果,但LoTA-QAF仍优于其他方法。
Strong Membership Inference Attacks on Massive Datasets and (Moderately) Large Language Models
Abstract
arXiv:2505.18773v1 Announce Type: cross Abstract: State-of-the-art membership inference attacks (MIAs) typically require training many reference models, making it difficult to scale these attacks to large pre-trained language models (LLMs). As a result, prior research has either relied on weaker attacks that avoid training reference models (e.g., fine-tuning attacks), or on stronger attacks applied to small-scale models and datasets. However, weaker attacks have been shown to be brittle - achieving close-to-arbitrary success - and insights from strong attacks in simplified settings do not translate to today's LLMs. These challenges have prompted an important question: are the limitations observed in prior work due to attack design choices, or are MIAs fundamentally ineffective on LLMs? We address this question by scaling LiRA - one of the strongest MIAs - to GPT-2 architectures ranging from 10M to 1B parameters, training reference models on over 20B tokens from the C4 dataset. Our results advance the understanding of MIAs on LLMs in three key ways: (1) strong MIAs can succeed on pre-trained LLMs; (2) their effectiveness, however, remains limited (e.g., AUC<0.7) in practical settings; and, (3) the relationship between MIA success and related privacy metrics is not as straightforward as prior work has suggested.
摘要
现有最先进的成员推断攻击(MIA)通常需要训练大量参考模型,这使得此类攻击难以扩展到大型预训练语言模型(LLM)。因此,先前研究要么依赖无需训练参考模型的较弱攻击(如微调攻击),要么将强力攻击应用于小规模模型和数据集。然而,较弱攻击已被证明具有脆弱性——其成功率接近随机水平——且在简化场景中强力攻击的洞察无法迁移至当今的LLM。这些挑战引出了一个关键问题:先前工作中观察到的局限性是源于攻击设计选择,还是MIA本质上对LLM无效?我们通过将最强MIA之一的LiRA扩展至参数规模从1000万到10亿的GPT-2架构(在C4数据集上训练超过200亿token的参考模型)来解答该问题。我们的研究从三个关键方面推进了对LLM上MIA的理解:(1)强力MIA可成功作用于预训练LLM;(2)但在实际场景中其有效性仍有限(如AUC<0.7);(3)MIA成功率与相关隐私指标的关系并非如先前研究暗示的那般直接。
ALPS: Attention Localization and Pruning Strategy for Efficient Alignment of Large Language Models
Abstract
arXiv:2505.18799v1 Announce Type: cross Abstract: Aligning general-purpose large language models (LLMs) to downstream tasks often incurs significant costs, including constructing task-specific instruction pairs and extensive training adjustments. Prior research has explored various avenues to enhance alignment efficiency, primarily through minimal-data training or data-driven activations to identify key attention heads. However, these approaches inherently introduce data dependency, which hinders generalization and reusability. To address this issue and enhance model alignment efficiency, we propose the \textit{\textbf{A}ttention \textbf{L}ocalization and \textbf{P}runing \textbf{S}trategy (\textbf{ALPS})}, an efficient algorithm that localizes the most task-sensitive attention heads and prunes by restricting attention training updates to these heads, thereby reducing alignment costs. Experimental results demonstrate that our method activates only \textbf{10%} of attention parameters during fine-tuning while achieving a \textbf{2%} performance improvement over baselines on three tasks. Moreover, the identified task-specific heads are transferable across datasets and mitigate knowledge forgetting. Our work and findings provide a novel perspective on efficient LLM alignment.
摘要
将通用大语言模型(LLMs)与下游任务对齐通常需要高昂成本,包括构建任务特定的指令对和大量训练调整。先前研究通过最小数据训练或数据驱动激活来识别关键注意力头,探索了多种提升对齐效率的途径。然而这些方法本质上存在数据依赖性,限制了泛化性和可复用性。为解决该问题并提升模型对齐效率,我们提出\textit{\textbf{注意力定位与剪枝策略(ALPS)}},该高效算法能定位任务最敏感的注意力头,并通过限制注意力训练更新至这些头部来实施剪枝,从而降低对齐成本。实验结果表明,我们的方法在微调期间仅激活\textbf{10%}的注意力参数,同时在三个任务上实现比基线模型\textbf{2%}的性能提升。此外,所识别的任务特定头部具有跨数据集可迁移性,并能缓解知识遗忘。本工作为高效LLM对齐提供了新视角。
HD-PiSSA: High-Rank Distributed Orthogonal Adaptation
Abstract
arXiv:2505.18777v1 Announce Type: cross Abstract: Existing parameter-efficient fine-tuning (PEFT) methods for large language models (LLMs), such as LoRA and PiSSA, constrain model updates to low-rank subspaces, limiting their expressiveness and leading to suboptimal performance on complex tasks. To address this, we introduce High-rank Distributed PiSSA (HD-PiSSA), a distributed PEFT approach that initializes orthogonal adapters across different devices and aggregates their delta updates collectively on W for fine-tuning. Unlike Data Parallel LoRA or PiSSA, which maintain identical adapters across all devices, HD-PiSSA assigns different principal components of the pre-trained weights to each GPU, significantly expanding the range of update directions. This results in over 16x higher effective updated ranks than data-parallel LoRA or PiSSA when fine-tuning on 8 GPUs with the same per-device adapter rank. Empirically, we evaluate HD-PiSSA across various challenging downstream tasks, including mathematics, code generation, and multi-task learning. In the multi-task setting, HD-PiSSA achieves average gains of 10.0 absolute points (14.63%) over LoRA and 4.98 points (6.60%) over PiSSA across 12 benchmarks, demonstrating its benefits from the extra optimization flexibility.
摘要
现有针对大语言模型(LLM)的参数高效微调方法(如LoRA和PiSSA)将模型更新限制在低秩子空间,制约了其表达能力,导致复杂任务性能欠佳。为此,我们提出高秩分布式PiSSA(HD-PiSSA),该方法通过在不同设备上初始化正交适配器,并聚合其对权重矩阵W的增量更新进行分布式微调。与数据并行的LoRA或PiSSA保持所有设备适配器一致不同,HD-PiSSA为每个GPU分配预训练权重矩阵的不同主成分,从而显著扩展更新方向的范围。当在8个GPU上以相同设备级适配器秩进行微调时,其有效更新秩达到数据并行LoRA或PiSSA的16倍以上。实验评估表明,在数学推理、代码生成和多任务学习等具有挑战性的下游任务中,HD-PiSSA表现优异。多任务场景下,该方法在12个基准测试中平均较LoRA提升10.0个绝对百分点(14.63%),较PiSSA提升4.98个点(6.60%),充分证明了其额外优化灵活性带来的优势。
REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing
Abstract
arXiv:2505.18880v1 Announce Type: cross Abstract: Short videos are an effective tool for promoting contents and improving knowledge accessibility. While existing extractive video summarization methods struggle to produce a coherent narrative, existing abstractive methods cannot `quote' from the input videos, i.e., inserting short video clips in their outputs. In this work, we explore novel video editing models for generating shorts that feature a coherent narrative with embedded video insertions extracted from a long input video. We propose a novel retrieval-embedded generation framework that allows a large language model to quote multimodal resources while maintaining a coherent narrative. Our proposed REGen system first generates the output story script with quote placeholders using a finetuned large language model, and then uses a novel retrieval model to replace the quote placeholders by selecting a video clip that best supports the narrative from a pool of candidate quotable video clips. We examine the proposed method on the task of documentary teaser generation, where short interview insertions are commonly used to support the narrative of a documentary. Our objective evaluations show that the proposed method can effectively insert short video clips while maintaining a coherent narrative. In a subjective survey, we show that our proposed method outperforms existing abstractive and extractive approaches in terms of coherence, alignment, and realism in teaser generation.
摘要
短视频是推广内容和提升知识可及性的有效工具。现有抽取式视频摘要方法难以生成连贯的叙事,而生成式方法则无法从输入视频中"引用"内容,即在输出中插入短视频片段。本研究探索了一种新型视频编辑模型,用于生成兼具连贯叙事和长视频片段引用的短视频。我们提出了一种创新的检索嵌入生成框架,使大语言模型在保持叙事连贯性的同时能够引用多模态资源。所提出的REGen系统首先通过微调的大语言模型生成带有引用占位符的故事脚本,随后利用新型检索模型从候选视频片段池中选择最能支撑叙事的片段进行替换填充。我们在纪录片预告片生成任务上验证了该方法,该场景中常通过简短采访片段来强化叙事。客观评估表明,该方法能有效插入短视频片段并保持叙事连贯性。主观调研显示,在预告片生成的连贯性、对齐度和真实性方面,本方法优于现有生成式和抽取式方法。
Writing Like the Best: Exemplar-Based Expository Text Generation
Abstract
arXiv:2505.18859v1 Announce Type: cross Abstract: We introduce the Exemplar-Based Expository Text Generation task, aiming to generate an expository text on a new topic using an exemplar on a similar topic. Current methods fall short due to their reliance on extensive exemplar data, difficulty in adapting topic-specific content, and issues with long-text coherence. To address these challenges, we propose the concept of Adaptive Imitation and present a novel Recurrent Plan-then-Adapt (RePA) framework. RePA leverages large language models (LLMs) for effective adaptive imitation through a fine-grained plan-then-adapt process. RePA also enables recurrent segment-by-segment imitation, supported by two memory structures that enhance input clarity and output coherence. We also develop task-specific evaluation metrics--imitativeness, adaptiveness, and adaptive-imitativeness--using LLMs as evaluators. Experimental results across our collected three diverse datasets demonstrate that RePA surpasses existing baselines in producing factual, consistent, and relevant texts for this task.
摘要
我们提出"基于范例的说明文生成"任务,旨在利用相似主题的范例生成新主题的说明文。现有方法因依赖大量范例数据、难以适配主题特定内容及长文本连贯性问题而存在局限。为解决这些挑战,我们提出"自适应模仿"概念,并设计新型"循环规划-适配"框架(RePA)。该框架通过细粒度"规划-适配"流程,利用大语言模型实现有效自适应模仿。RePA还支持基于两种记忆结构的逐段循环模仿,从而提升输入清晰度与输出连贯性。我们采用大语言模型作为评估器,开发了任务特异性指标——模仿度、适配度及自适应模仿度。在自建的三个多样化数据集上的实验结果表明,RePA在生成事实准确、内容一致且主题相关的文本方面优于现有基线方法。
PromptWise: Online Learning for Cost-Aware Prompt Assignment in Generative Models
Abstract
arXiv:2505.18901v1 Announce Type: cross Abstract: The rapid advancement of generative AI models has provided users with numerous options to address their prompts. When selecting a generative AI model for a given prompt, users should consider not only the performance of the chosen model but also its associated service cost. The principle guiding such consideration is to select the least expensive model among the available satisfactory options. However, existing model-selection approaches typically prioritize performance, overlooking pricing differences between models. In this paper, we introduce PromptWise, an online learning framework designed to assign a sequence of prompts to a group of large language models (LLMs) in a cost-effective manner. PromptWise strategically queries cheaper models first, progressing to more expensive options only if the lower-cost models fail to adequately address a given prompt. Through numerical experiments, we demonstrate PromptWise's effectiveness across various tasks, including puzzles of varying complexity and code generation/translation tasks. The results highlight that PromptWise consistently outperforms cost-unaware baseline methods, emphasizing that directly assigning prompts to the most expensive models can lead to higher costs and potentially lower average performance.
摘要
生成式AI模型的快速发展为用户提供了多种响应提示的选择。在为给定提示选择生成式AI模型时,用户不仅应考虑所选模型的性能,还需关注其相关服务成本。指导原则是在可用的满意选项中选择成本最低的模型。然而,现有模型选择方法通常优先考虑性能,忽视了模型间的定价差异。本文提出PromptWise——一个在线学习框架,旨在以经济高效的方式将提示序列分配给一组大语言模型(LLM)。该策略首先查询成本较低的模型,仅当低成本模型无法充分响应提示时,才会转向更昂贵的选项。通过数值实验,我们验证了PromptWise在不同任务中的有效性,包括复杂度各异的谜题解决及代码生成/翻译任务。结果表明:PromptWise始终优于不考虑成本的基线方法,这证实直接为提示分配最昂贵模型会导致更高成本,并可能降低平均性能。
Security Concerns for Large Language Models: A Survey
Abstract
arXiv:2505.18889v1 Announce Type: cross Abstract: Large Language Models (LLMs) such as GPT-4 (and its recent iterations like GPT-4o and the GPT-4.1 series), Google's Gemini, Anthropic's Claude 3 models, and xAI's Grok have caused a revolution in natural language processing, but their capabilities also introduce new security vulnerabilities. In this survey, we provide a comprehensive overview of the emerging security concerns around LLMs, categorizing threats into prompt injection and jailbreaking, adversarial attacks (including input perturbations and data poisoning), misuse by malicious actors (e.g., for disinformation, phishing, and malware generation), and worrisome risks inherent in autonomous LLM agents. A significant focus has been recently placed on the latter, exploring goal misalignment, emergent deception, self-preservation instincts, and the potential for LLMs to develop and pursue covert, misaligned objectives (scheming), which may even persist through safety training. We summarize recent academic and industrial studies (2022-2025) that exemplify each threat, analyze proposed defenses and their limitations, and identify open challenges in securing LLM-based applications. We conclude by emphasizing the importance of advancing robust, multi-layered security strategies to ensure LLMs are safe and beneficial.
摘要
诸如GPT-4(及其近期迭代版本如GPT-4o和GPT-4.1系列)、谷歌Gemini、Anthropic的Claude 3模型以及xAI的Grok等大语言模型(LLMs)引发了自然语言处理领域的革命,但其能力也带来了新的安全漏洞。本综述全面梳理了围绕LLMs新出现的安全问题,将威胁归类为提示注入与越狱、对抗攻击(包括输入扰动和数据投毒)、恶意行为者滥用(如用于虚假信息、钓鱼攻击和恶意软件生成)以及自主LLM智能体固有的高风险隐患。近期研究重点聚焦于最后一项,探讨目标错位、涌现性欺骗、自我保存本能,以及LLMs可能形成并追求隐蔽且错位目标(密谋)的潜在风险——这些行为甚至可能在安全训练后持续存在。我们汇总了2022-2025年间体现各类威胁的学术与产业研究实例,分析现有防御方案及其局限性,并指出保障LLM应用安全面临的开放挑战。最后强调必须发展鲁棒的多层安全策略,以确保LLMs的安全性与有益性。
CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions
Abstract
arXiv:2505.18878v1 Announce Type: cross Abstract: While AI agents hold transformative potential in business, effective performance benchmarking is hindered by the scarcity of public, realistic business data on widely used platforms. Existing benchmarks often lack fidelity in their environments, data, and agent-user interactions, with limited coverage of diverse business scenarios and industries. To address these gaps, we introduce CRMArena-Pro, a novel benchmark for holistic, realistic assessment of LLM agents in diverse professional settings. CRMArena-Pro expands on CRMArena with nineteen expert-validated tasks across sales, service, and 'configure, price, and quote' processes, for both Business-to-Business and Business-to-Customer scenarios. It distinctively incorporates multi-turn interactions guided by diverse personas and robust confidentiality awareness assessments. Experiments reveal leading LLM agents achieve only around 58% single-turn success on CRMArena-Pro, with performance dropping significantly to approximately 35% in multi-turn settings. While Workflow Execution proves more tractable for top agents (over 83% single-turn success), other evaluated business skills present greater challenges. Furthermore, agents exhibit near-zero inherent confidentiality awareness; though targeted prompting can improve this, it often compromises task performance. These findings highlight a substantial gap between current LLM capabilities and enterprise demands, underscoring the need for advancements in multi-turn reasoning, confidentiality adherence, and versatile skill acquisition.
摘要
虽然AI代理在商业领域具有变革潜力,但由于广泛使用平台上公开、真实的商业数据匮乏,其性能基准测试的有效性受到制约。现有基准测试在环境模拟、数据真实性及代理-用户交互方面往往缺乏保真度,且对多样化商业场景和行业的覆盖有限。为弥补这些不足,我们推出CRMArena-Pro——一个用于全面、真实评估大语言模型代理在多元专业场景中表现的新型基准测试体系。该系统在CRMArena基础上扩展,涵盖销售、服务和"配置-定价-报价"流程的19项专家验证任务,同时支持企业级(B2B)和消费者级(B2C)场景。其独特之处在于整合了多角色引导的多轮次交互机制,以及严格的保密意识评估体系。实验数据显示,领先的大语言模型代理在CRMArena-Pro单轮测试中成功率仅约58%,而在多轮交互环境下性能显著下降至35%左右。尽管工作流执行对顶级代理更具可操作性(单轮成功率超83%),其他被测商业技能则表现出更大挑战性。此外,代理表现出近乎零的固有保密意识,虽然针对性提示能改善此问题,但往往以任务性能下降为代价。这些发现揭示了当前大语言模型能力与企业需求间的显著差距,凸显了在多轮推理、保密合规及多技能习得等方面进行技术突破的必要性。
Behavior Injection: Preparing Language Models for Reinforcement Learning
Abstract
arXiv:2505.18917v1 Announce Type: cross Abstract: Reinforcement fine-tuning (RFT) has emerged as a powerful post-training technique to incentivize the reasoning ability of large language models (LLMs). However, LLMs can respond very inconsistently to RFT: some show substantial performance gains, while others plateau or even degrade. To understand this divergence, we analyze the per-step influence of the RL objective and identify two key conditions for effective post-training: (1) RL-informative rollout accuracy, and (2) strong data co-influence, which quantifies how much the training data affects performance on other samples. Guided by these insights, we propose behavior injection, a task-agnostic data-augmentation scheme applied prior to RL. Behavior injection enriches the supervised finetuning (SFT) data by seeding exploratory and exploitative behaviors, effectively making the model more RL-ready. We evaluate our method across two reasoning benchmarks with multiple base models. The results demonstrate that our theoretically motivated augmentation can significantly increases the performance gain from RFT over the pre-RL model.
摘要
强化微调(RFT)已成为一种强大的训练后技术,用于增强大语言模型(LLMs)的推理能力。然而,LLMs对RFT的反应可能非常不一致:部分模型表现出显著的性能提升,而其他模型则停滞不前甚至性能下降。为理解这种差异,我们分析了RL目标在每一步的影响,并确定了有效训练后的两个关键条件:(1)RL信息化的推演准确率,以及(2)强数据共影响力(用于量化训练数据对其他样本性能的影响程度)。基于这些发现,我们提出了行为注入——一种在RL之前应用的与任务无关的数据增强方案。该方案通过植入探索性和利用性行为来丰富监督微调(SFT)数据,从而有效提升模型的RL适应性。我们在两个推理基准测试中采用多种基础模型评估了该方法。结果表明,这种理论驱动的数据增强能显著提高RFT相对于RL前模型的性能增益。
The Price of Format: Diversity Collapse in LLMs
Abstract
arXiv:2505.18949v1 Announce Type: cross Abstract: Instruction-tuned large language models (LLMs) employ structured templates, such as role markers and special tokens, to enforce format consistency during inference. However, we identify a critical limitation of such formatting: it induces a phenomenon we term diversity collapse, where the model generates semantically similar outputs for open-ended inputs, undermining creativity and variability. We systematically evaluate this effect across tasks like story completion and free-form generation, finding that (1) diversity collapse persists even under high-temperature sampling, and (2) structural tokens in templates significantly constrain the model's output space. To contextualize these findings, we fine-tune the same model using a range of structured prompts and then evaluate them across three axes: downstream task performance, alignment behavior, and output diversity. Our analysis shows that format consistency between fine-tuning and inference is crucial for structure-sensitive tasks (e.g., GSM8K, IFEval), but has marginal influence on knowledge-heavy tasks (e.g., MMLU, WebQuestions). In contrast, output diversity is primarily governed by the presence or absence of structural tokens, with minimal formatting yielding the most diverse outputs. These findings reveal that current prompting conventions, while beneficial for alignment, may inadvertently suppress output diversity, underscoring the need for diversity-aware prompt design and instruction tuning.
摘要
指令调优的大型语言模型(LLMs)采用结构化模板(如角色标记和特殊符号)以在推理过程中保持格式一致性。然而,我们发现这种格式化存在一个关键缺陷:它会导致我们称之为"多样性坍缩"的现象,即模型针对开放式输入生成语义相似的输出,从而削弱了创造性和变异性。我们通过故事补全和自由生成等任务系统评估了这一效应,发现:(1)即使在高温度采样下,多样性坍缩依然存在;(2)模板中的结构符号显著限制了模型的输出空间。为量化这些发现,我们使用一系列结构化提示对同一模型进行微调,并从三个维度进行评估:下游任务表现、对齐行为和输出多样性。分析表明,微调与推理间的格式一致性对结构敏感型任务(如GSM8K、IFEval)至关重要,但对知识密集型任务(如MMLU、WebQuestions)影响有限。相比之下,输出多样性主要受结构符号存在与否的调控,最小化格式化能产生最多样化的输出。这些发现揭示,当前提示规范虽有利于对齐,却可能无意中抑制输出多样性,这凸显了设计多样性感知的提示模板和指令调优的必要性。
Benchmarking Large Language Models for Cyberbullying Detection in Real-World YouTube Comments
Abstract
arXiv:2505.18927v1 Announce Type: cross Abstract: As online platforms grow, comment sections increasingly host harassment that undermines user experience and well-being. This study benchmarks three leading large language models, OpenAI GPT-4.1, Google Gemini 1.5 Pro, and Anthropic Claude 3 Opus, on a corpus of 5,080 YouTube comments sampled from high-abuse threads in gaming, lifestyle, food vlog, and music channels. The dataset comprises 1,334 harmful and 3,746 non-harmful messages in English, Arabic, and Indonesian, annotated independently by two reviewers with substantial agreement (Cohen's kappa = 0.83). Using a unified prompt and deterministic settings, GPT-4.1 achieved the best overall balance with an F1 score of 0.863, precision of 0.887, and recall of 0.841. Gemini flagged the highest share of harmful posts (recall = 0.875) but its precision fell to 0.767 due to frequent false positives. Claude delivered the highest precision at 0.920 and the lowest false-positive rate of 0.022, yet its recall dropped to 0.720. Qualitative analysis showed that all three models struggle with sarcasm, coded insults, and mixed-language slang. These results underscore the need for moderation pipelines that combine complementary models, incorporate conversational context, and fine-tune for under-represented languages and implicit abuse. A de-identified version of the dataset and full prompts is publicly released to promote reproducibility and further progress in automated content moderation.
摘要
随着网络平台的发展,评论区日益增多的骚扰行为损害了用户体验和心理健康。本研究以游戏、生活方式、美食视频博客和音乐频道中高攻击性讨论区的5,080条YouTube评论为样本,对OpenAI GPT-4.1、Google Gemini 1.5 Pro和Anthropic Claude 3 Opus三大主流大语言模型进行基准测试。该数据集包含1,334条有害信息和3,746条无害信息,涵盖英语、阿拉伯语和印尼语,经两位评审员独立标注且具有高度一致性(Cohen's kappa = 0.83)。采用统一提示词和确定性参数设置时,GPT-4.1以0.863的F1值、0.887的精确率和0.841的召回率取得最佳综合平衡。Gemini标记有害内容的比例最高(召回率=0.875),但因频繁误报导致精确率降至0.767。Claude以0.920的精确率和0.022的最低误报率表现最优,但其召回率下降至0.720。定性分析表明,三种模型均难以识别讽刺、隐晦侮辱和混合语言俚语。这些结果凸显了建立内容审核管道的必要性:需整合互补模型、结合对话语境,并针对低资源语言和隐性攻击进行优化。本研究公开了匿名化数据集和完整提示词,以促进自动化内容审核研究的可重复性和进一步发展。
FiLLM -- A Filipino-optimized Large Language Model based on Southeast Asia Large Language Model (SEALLM)
Abstract
arXiv:2505.18995v1 Announce Type: cross Abstract: This study presents FiLLM, a Filipino-optimized large language model, designed to enhance natural language processing (NLP) capabilities in the Filipino language. Built upon the SeaLLM-7B 2.5 model, FiLLM leverages Low-Rank Adaptation (LoRA) fine-tuning to optimize memory efficiency while maintaining task-specific performance. The model was trained and evaluated on diverse Filipino datasets to address key NLP tasks, including Named Entity Recognition (NER), Part-of-Speech (POS) tagging, Dependency Parsing, and Text Summarization. Performance comparisons with the CalamanCy model were conducted using F1 Score, Precision, Recall, Compression Rate, and Keyword Overlap metrics. Results indicate that Calamancy outperforms FILLM in several aspects, demonstrating its effectiveness in processing Filipino text with improved linguistic comprehension and adaptability. This research contributes to the advancement of Filipino NLP applications by providing an optimized, efficient, and scalable language model tailored for local linguistic needs.
摘要
本研究提出FiLLM——一个针对菲律宾语优化的开源大语言模型,旨在提升菲律宾语自然语言处理(NLP)能力。该模型基于SeaLLM-7B 2.5架构,采用低秩自适应(LoRA)微调技术,在保持任务特定性能的同时优化内存效率。研究通过多样化的菲律宾语数据集对模型进行训练与评估,涵盖命名实体识别(NER)、词性标注(POS)、依存句法分析和文本摘要等核心NLP任务。与CalamanCy模型的性能对比采用F1值、精确率、召回率、压缩率和关键词重叠度等指标。结果表明,CalamanCy在多项指标上优于FiLLM,展现出其在菲律宾语文本处理中更强的语言理解能力与适应性。本研究通过开发针对本土语言需求定制的高效、可扩展优化模型,为菲律宾语NLP应用的发展做出贡献。
An Initial Exploration of Fine-tuning Small Language Models for Smart Contract Reentrancy Vulnerability Detection
Abstract
arXiv:2505.19059v1 Announce Type: cross Abstract: Large Language Models (LLMs) are being used more and more for various coding tasks, including to help coders identify bugs and are a promising avenue to support coders in various tasks including vulnerability detection -- particularly given the flexibility of such generative AI models and tools. Yet for many tasks it may not be suitable to use LLMs, for which it may be more suitable to use smaller language models that can fit and easily execute and train on a developer's computer. In this paper we explore and evaluate whether smaller language models can be fine-tuned to achieve reasonable results for a niche area: vulnerability detection -- specifically focusing on detecting the reentrancy bug in Solidity smart contracts.
摘要
大型语言模型(LLMs)正越来越多地应用于各类编码任务,包括帮助程序员识别代码缺陷,并因其生成式人工智能模型与工具的高度灵活性,成为支持漏洞检测等多样化任务的重要途径。然而对于许多场景而言,使用LLMs可能并不适宜,此时更适合采用能在开发者计算机上轻松部署、训练和执行的较小规模语言模型。本文针对特定领域——智能合约漏洞检测(尤其侧重于Solidity合约中的重入漏洞识别),系统探究并评估了通过微调小型语言模型能否获得有效检测效果。
InfoChartQA: A Benchmark for Multimodal Question Answering on Infographic Charts
Abstract
arXiv:2505.19028v1 Announce Type: cross Abstract: Understanding infographic charts with design-driven visual elements (e.g., pictograms, icons) requires both visual recognition and reasoning, posing challenges for multimodal large language models (MLLMs). However, existing visual-question answering benchmarks fall short in evaluating these capabilities of MLLMs due to the lack of paired plain charts and visual-element-based questions. To bridge this gap, we introduce InfoChartQA, a benchmark for evaluating MLLMs on infographic chart understanding. It includes 5,642 pairs of infographic and plain charts, each sharing the same underlying data but differing in visual presentations. We further design visual-element-based questions to capture their unique visual designs and communicative intent. Evaluation of 20 MLLMs reveals a substantial performance decline on infographic charts, particularly for visual-element-based questions related to metaphors. The paired infographic and plain charts enable fine-grained error analysis and ablation studies, which highlight new opportunities for advancing MLLMs in infographic chart understanding. We release InfoChartQA at https://github.com/CoolDawnAnt/InfoChartQA.
摘要
理解包含设计驱动视觉元素(如图标、符号)的信息图表需要视觉识别与推理能力,这对多模态大语言模型(MLLMs)提出了挑战。然而,由于缺乏配对的普通图表和基于视觉元素的问题,现有视觉问答基准难以有效评估MLLMs的上述能力。为填补这一空白,我们提出InfoChartQA基准,用于评估MLLMs在信息图表理解上的表现。该基准包含5,642对信息图表与普通图表,每对共享相同底层数据但呈现形式不同。我们进一步设计了基于视觉元素的问题,以捕捉其独特的视觉设计及传达意图。对20个MLLMs的评估表明,模型在信息图表上的性能显著下降,尤其表现在涉及隐喻的视觉元素问题上。配对的图表设计支持细粒度错误分析与消融研究,揭示了提升MLLMs信息图表理解能力的新机遇。项目已发布于https://github.com/CoolDawnAnt/InfoChartQA。
An Embarrassingly Simple Defense Against LLM Abliteration Attacks
Abstract
arXiv:2505.19056v1 Announce Type: cross Abstract: Large language models (LLMs) are typically aligned to comply with safety guidelines by refusing harmful instructions. A recent attack, termed abliteration, isolates and suppresses the single latent direction most responsible for refusal behavior, enabling the model to generate unethical content. We propose a defense that modifies how models generate refusals. We construct an extended-refusal dataset that contains harmful prompts with a full response that justifies the reason for refusal. We then fine-tune Llama-2-7B-Chat and Qwen2.5-Instruct (1.5B and 3B parameters) on our extended-refusal dataset, and evaluate the resulting systems on a set of harmful prompts. In our experiments, extended-refusal models maintain high refusal rates, dropping at most by 10%, whereas baseline models' refusal rates drop by 70-80% after abliteration. A broad evaluation of safety and utility shows that extended-refusal fine-tuning neutralizes the abliteration attack while preserving general performance.
摘要
大型语言模型(LLMs)通常通过拒绝有害指令来遵循安全准则。最近出现了一种名为"消融攻击"(abliteration)的攻击方法,该方法通过隔离并抑制导致拒绝行为的最关键潜在方向,使模型能够生成不道德内容。我们提出了一种改进模型拒绝生成机制的防御方案。首先构建了一个扩展拒绝数据集,其中包含有害提示及完整阐述拒绝理由的回应文本。随后基于该数据集对Llama-2-7B-Chat和Qwen2.5-Instruct(15亿和30亿参数)模型进行微调,并在有害提示集上评估改进后的系统。实验表明,扩展拒绝模型能保持90%以上的高拒绝率,而基线模型在消融攻击后拒绝率下降70-80%。综合安全性与实用性的评估显示,扩展拒绝微调既能有效抵御消融攻击,又保持了模型的整体性能。
CCHall: A Novel Benchmark for Joint Cross-Lingual and Cross-Modal Hallucinations Detection in Large Language Models
Abstract
arXiv:2505.19108v1 Announce Type: cross Abstract: Investigating hallucination issues in large language models (LLMs) within cross-lingual and cross-modal scenarios can greatly advance the large-scale deployment in real-world applications. Nevertheless, the current studies are limited to a single scenario, either cross-lingual or cross-modal, leaving a gap in the exploration of hallucinations in the joint cross-lingual and cross-modal scenarios. Motivated by this, we introduce a novel joint Cross-lingual and Cross-modal Hallucinations benchmark (CCHall) to fill this gap. Specifically, CCHall simultaneously incorporates both cross-lingual and cross-modal hallucination scenarios, which can be used to assess the cross-lingual and cross-modal capabilities of LLMs. Furthermore, we conduct a comprehensive evaluation on CCHall, exploring both mainstream open-source and closed-source LLMs. The experimental results highlight that current LLMs still struggle with CCHall. We hope CCHall can serve as a valuable resource to assess LLMs in joint cross-lingual and cross-modal scenarios.
摘要
研究大语言模型(LLMs)在跨语言与跨模态场景中的幻觉问题,对推动其在实际应用中的大规模部署具有重要意义。然而,现有研究仅局限于单一场景(跨语言或跨模态),尚未探索跨语言与跨模态联合场景下的幻觉现象。为此,我们提出了首个跨语言与跨模态联合幻觉基准(CCHall)以填补这一空白。具体而言,CCHall同时涵盖跨语言和跨模态幻觉场景,可用于评估LLMs的跨语言与跨模态能力。此外,我们对主流开源与闭源LLMs进行了全面评测,实验结果表明当前LLMs在CCHall上仍面临显著挑战。我们希望CCHall能成为评估跨语言与跨模态联合场景下LLMs性能的重要资源。
Medical Large Vision Language Models with Multi-Image Visual Ability
Abstract
arXiv:2505.19031v1 Announce Type: cross Abstract: Medical large vision-language models (LVLMs) have demonstrated promising performance across various single-image question answering (QA) benchmarks, yet their capability in processing multi-image clinical scenarios remains underexplored. Unlike single image based tasks, medical tasks involving multiple images often demand sophisticated visual understanding capabilities, such as temporal reasoning and cross-modal analysis, which are poorly supported by current medical LVLMs. To bridge this critical gap, we present the Med-MIM instruction dataset, comprising 83.2K medical multi-image QA pairs that span four types of multi-image visual abilities (temporal understanding, reasoning, comparison, co-reference). Using this dataset, we fine-tune Mantis and LLaVA-Med, resulting in two specialized medical VLMs: MIM-LLaVA-Med and Med-Mantis, both optimized for multi-image analysis. Additionally, we develop the Med-MIM benchmark to comprehensively evaluate the medical multi-image understanding capabilities of LVLMs. We assess eight popular LVLMs, including our two models, on the Med-MIM benchmark. Experimental results show that both Med-Mantis and MIM-LLaVA-Med achieve superior performance on the held-in and held-out subsets of the Med-MIM benchmark, demonstrating that the Med-MIM instruction dataset effectively enhances LVLMs' multi-image understanding capabilities in the medical domain.
摘要
医学大型视觉语言模型(LVLM)在各种单图像问答(QA)基准测试中展现出优异性能,但其处理多图像临床场景的能力仍待探索。与基于单图像的任务不同,涉及多图像的医学任务通常需要复杂的视觉理解能力(如时序推理和跨模态分析),而当前医学LVLM对此类能力的支持严重不足。为填补这一关键空白,我们提出了Med-MIM指令数据集,包含83.2K个涵盖四种多图像视觉能力(时序理解、推理、比较、共指)的医学多图像问答对。基于该数据集,我们对Mantis和LLaVA-Med进行微调,得到两个专精于多图像分析的医学视觉语言模型:MIM-LLaVA-Med和Med-Mantis。此外,我们开发了Med-MIM基准测试,用于全面评估LVLM的医学多图像理解能力。我们对包括两个新模型在内的八种主流LVLM进行了测试,实验结果表明:Med-Mantis和MIM-LLaVA-Med在Med-MIM基准测试的保留集和外部集上均表现卓越,证实Med-MIM指令数据集能有效提升LVLM在医学领域的多图像理解能力。
FP4 All the Way: Fully Quantized Training of LLMs
Abstract
arXiv:2505.19115v1 Announce Type: cross Abstract: We demonstrate, for the first time, fully quantized training (FQT) of large language models (LLMs) using predominantly 4-bit floating-point (FP4) precision for weights, activations, and gradients on datasets up to 200 billion tokens. We extensively investigate key design choices for FP4, including block sizes, scaling formats, and rounding methods. Our analysis shows that the NVFP4 format, where each block of 16 FP4 values (E2M1) shares a scale represented in E4M3, provides optimal results. We use stochastic rounding for backward and update passes and round-to-nearest for the forward pass to enhance stability. Additionally, we identify a theoretical and empirical threshold for effective quantized training: when the gradient norm falls below approximately \sqrt{3} times the quantization noise, quantized training becomes less effective. Leveraging these insights, we successfully train a 7-billion-parameter model on 256 Intel Gaudi2 accelerators. The resulting FP4-trained model achieves downstream task performance comparable to a standard BF16 baseline, confirming that FP4 training is a practical and highly efficient approach for large-scale LLM training. A reference implementation is supplied in https://github.com/Anonymous1252022/fp4-all-the-way .
摘要
我们首次展示了在多达2000亿标记的数据集上,主要使用4位浮点(FP4)精度对大型语言模型(LLM)进行全量化训练(FQT),涵盖权重、激活值和梯度。我们深入研究了FP4的关键设计选择,包括块大小、缩放格式和舍入方法。分析表明,采用NVFP4格式(即每16个FP4值(E2M1)共享一个E4M3表示的缩放因子)可获得最佳结果。在反向传播和参数更新阶段使用随机舍入,前向传播阶段采用就近舍入以增强稳定性。此外,我们发现量化训练有效性的理论及实证阈值:当梯度范数低于量化噪声约\sqrt{3}倍时,量化训练效果会下降。基于这些发现,我们在256个英特尔Gaudi2加速器上成功训练了一个70亿参数模型。该FP4训练模型在下游任务中达到与 标准BF16基线相当的性能,证实FP4训练是大规模LLM训练中实用且高效的方法。参考实现详见https://github.com/Anonymous1252022/fp4-all-the-way。
RetrieveAll: A Multilingual Named Entity Recognition Framework with Large Language Models
Abstract
arXiv:2505.19128v1 Announce Type: cross Abstract: The rise of large language models has led to significant performance breakthroughs in named entity recognition (NER) for high-resource languages, yet there remains substantial room for improvement in low- and medium-resource languages. Existing multilingual NER methods face severe language interference during the multi-language adaptation process, manifested in feature conflicts between different languages and the competitive suppression of low-resource language features by high-resource languages. Although training a dedicated model for each language can mitigate such interference, it lacks scalability and incurs excessive computational costs in real-world applications. To address this issue, we propose RetrieveAll, a universal multilingual NER framework based on dynamic LoRA. The framework decouples task-specific features across languages and demonstrates efficient dynamic adaptability. Furthermore, we introduce a cross-granularity knowledge augmented method that fully exploits the intrinsic potential of the data without relying on external resources. By leveraging a hierarchical prompting mechanism to guide knowledge injection, this approach advances the paradigm from "prompt-guided inference" to "prompt-driven learning." Experimental results show that RetrieveAll outperforms existing baselines; on the PAN-X dataset, it achieves an average F1 improvement of 12.1 percent.
摘要
大型语言模型的兴起使得高资源语言在命名实体识别(NER)任务上取得显著性能突破,但中低资源语言仍有较大提升空间。现有多语言NER方法在多语言适配过程中面临严重的语言干扰问题,表现为不同语言间的特征冲突以及高资源语言对低资源语言特征的竞争性抑制。尽管为每种语言训练专用模型可缓解此类干扰,但该方法缺乏可扩展性,且在实际应用中会产生过高计算成本。为解决这一问题,我们提出RetrieveAll——一个基于动态LoRA的通用多语言NER框架。该框架实现了跨语言任务特征的解耦,并展现出高效的动态适应能力。此外,我们提出一种跨粒度知识增强方法,在不依赖外部资源的情况下充分挖掘数据内在潜力。通过采用分层提示机制引导知识注入,该方法将范式从"提示引导推理"推进至"提示驱动学习"。实验结果表明,RetrieveAll优于现有基线模型;在PAN-X数据集上平均F1值提升达12.1%。
SpokenNativQA: Multilingual Everyday Spoken Queries for LLMs
Abstract
arXiv:2505.19163v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across various disciplines and tasks. However, benchmarking their capabilities with multilingual spoken queries remains largely unexplored. In this study, we introduce SpokenNativQA, the first multilingual and culturally aligned spoken question-answering (SQA) dataset designed to evaluate LLMs in real-world conversational settings. The dataset comprises approximately 33,000 naturally spoken questions and answers in multiple languages, including low-resource and dialect-rich languages, providing a robust benchmark for assessing LLM performance in speech-based interactions. SpokenNativQA addresses the limitations of text-based QA datasets by incorporating speech variability, accents, and linguistic diversity. We benchmark different ASR systems and LLMs for SQA and present our findings. We released the data at (https://huggingface.co/datasets/QCRI/SpokenNativQA) and the experimental scripts at (https://llmebench.qcri.org/) for the research community.
摘要
大语言模型(LLMs)已在多学科和任务中展现出卓越性能。然而,针对多语言口语查询的能力基准测试仍存在较大研究空白。本研究推出SpokenNativQA——首个为评估现实对话场景中LLMs表现而设计的、具备多语言与文化对齐特性的口语问答(SQA)数据集。该数据集包含约33,000条自然口语形式的多语言问答对,涵盖资源稀缺和方言丰富的语种,为语音交互场景下的LLM性能评估提供了可靠基准。SpokenNativQA通过纳入语音变异、口音及语言多样性,弥补了文本问答数据集的局限性。我们对不同自动语音识别系统及LLMs进行了SQA基准测试并呈现结果。相关数据(https://huggingface.co/datasets/QCRI/SpokenNativQA)与实验脚本(https://llmebench.qcri.org/)已向研究社区开源。
Shifting AI Efficiency From Model-Centric to Data-Centric Compression
Abstract
arXiv:2505.19147v1 Announce Type: cross Abstract: The rapid advancement of large language models (LLMs) and multi-modal LLMs (MLLMs) has historically relied on model-centric scaling through increasing parameter counts from millions to hundreds of billions to drive performance gains. However, as we approach hardware limits on model size, the dominant computational bottleneck has fundamentally shifted to the quadratic cost of self-attention over long token sequences, now driven by ultra-long text contexts, high-resolution images, and extended videos. In this position paper, \textbf{we argue that the focus of research for efficient AI is shifting from model-centric compression to data-centric compression}. We position token compression as the new frontier, which improves AI efficiency via reducing the number of tokens during model training or inference. Through comprehensive analysis, we first examine recent developments in long-context AI across various domains and establish a unified mathematical framework for existing model efficiency strategies, demonstrating why token compression represents a crucial paradigm shift in addressing long-context overhead. Subsequently, we systematically review the research landscape of token compression, analyzing its fundamental benefits and identifying its compelling advantages across diverse scenarios. Furthermore, we provide an in-depth analysis of current challenges in token compression research and outline promising future directions. Ultimately, our work aims to offer a fresh perspective on AI efficiency, synthesize existing research, and catalyze innovative developments to address the challenges that increasing context lengths pose to the AI community's advancement.
摘要
在本文立场论文中,我们主张高效人工智能的研究重心正从模型中心化压缩转向数据中心化压缩。我们将令牌压缩确立为新前沿领域,其通过减少模型训练或推理时的令牌数量来提升AI效率。通过全面分析,首先考察了跨领域长上下文AI的最新进展,建立了现有模型效率策略的统一数学框架,论证了为何令牌压缩是解决长上下文开销的关键范式转变。随后系统梳理了令牌压缩的研究格局,分析其基础优势并揭示其在多元场景中的显著价值。进一步深入探讨了当前令牌压缩研究的核心挑战,并展望了未来发展方向。本研究旨在为AI效率提供新视角,整合现有成果,并推动创新突破以应对日益增长的上下文长度对AI领域发展提出的挑战。
OptiMindTune: A Multi-Agent Framework for Intelligent Hyperparameter Optimization
Abstract
arXiv:2505.19205v1 Announce Type: cross Abstract: Hyperparameter optimization (HPO) is a critical yet challenging aspect of machine learning model development, significantly impacting model performance and generalization. Traditional HPO methods often struggle with high dimensionality, complex interdependencies, and computational expense. This paper introduces OptiMindTune, a novel multi-agent framework designed to intelligently and efficiently optimize hyperparameters. OptiMindTune leverages the collaborative intelligence of three specialized AI agents -- a Recommender Agent, an Evaluator Agent, and a Decision Agent -- each powered by Google's Gemini models. These agents address distinct facets of the HPO problem, from model selection and hyperparameter suggestion to robust evaluation and strategic decision-making. By fostering dynamic interactions and knowledge sharing, OptiMindTune aims to converge to optimal hyperparameter configurations more rapidly and robustly than existing single-agent or monolithic approaches. Our framework integrates principles from advanced large language models, and adaptive search to achieve scalable and intelligent AutoML. We posit that this multi-agent paradigm offers a promising avenue for tackling the increasing complexity of modern machine learning model tuning.
摘要
超参数优化(HPO)是机器学习模型开发中关键但具有挑战性的环节,对模型性能与泛化能力影响显著。传统HPO方法常受限于高维度、复杂参数关联及高昂计算成本。本文提出OptiMindTune——一种新型多智能体框架,旨在智能高效地优化超参数。该框架利用三个由Google Gemini模型驱动的专业AI智能体(推荐智能体、评估智能体与决策智能体)的协同智能,分别处理HPO问题的不同层面,包括模型选择、超参数建议、鲁棒性评估及策略决策。通过促进动态交互与知识共享,OptiMindTune相比现有单智能体或整体式方法能以更快速度、更强鲁棒性收敛至最优超参数配置。本框架融合了先进大语言模型与自适应搜索原理,实现可扩展的智能自动化机器学习。我们认为这种多智能体范式为解决现代机器学习模型调参日益增长的复杂性提供了可行路径。
POQD: Performance-Oriented Query Decomposer for Multi-vector retrieval
Abstract
arXiv:2505.19189v1 Announce Type: cross Abstract: Although Multi-Vector Retrieval (MVR) has achieved the state of the art on many information retrieval (IR) tasks, its performance highly depends on how to decompose queries into smaller pieces, say phrases or tokens. However, optimizing query decomposition for MVR performance is not end-to-end differentiable. Even worse, jointly solving this problem and training the downstream retrieval-based systems, say RAG systems could be highly inefficient. To overcome these challenges, we propose Performance-Oriented Query Decomposer (POQD), a novel query decomposition framework for MVR. POQD leverages one LLM for query decomposition and searches the optimal prompt with an LLM-based optimizer. We further propose an end-to-end training algorithm to alternatively optimize the prompt for query decomposition and the downstream models. This algorithm can achieve superior MVR performance at a reasonable training cost as our theoretical analysis suggests. POQD can be integrated seamlessly into arbitrary retrieval-based systems such as Retrieval-Augmented Generation (RAG) systems. Extensive empirical studies on representative RAG-based QA tasks show that POQD outperforms existing query decomposition strategies in both retrieval performance and end-to-end QA accuracy. POQD is available at https://github.com/PKU-SDS-lab/POQD-ICML25.
摘要
尽管多向量检索(MVR)在许多信息检索(IR)任务中达到了最先进的性能,但其效果高度依赖于如何将查询分解为更小的片段(如短语或词元)。然而,为优化MVR性能而进行的查询分解并非端到端可微分。更严重的是,将该问题与下游基于检索的系统(如RAG系统)联合训练时效率极低。为克服这些挑战,我们提出面向性能的查询分解器(POQD)——一种新型的MVR查询分解框架。POQD利用一个大语言模型(LLM)进行查询分解,并通过基于LLM的优化器搜索最优提示。我们进一步提出一种端到端训练算法,交替优化查询分解提示与下游模型。理论分析表明,该算法能以合理训练成本实现卓越的MVR性能。POQD可无缝集成至任意基于检索的系统(如检索增强生成系统)。在典型RAG问答任务上的大量实验表明,POQD在检索性能和端到端问答准确率上均优于现有查询分解策略。POQD代码已开源:https://github.com/PKU-SDS-lab/POQD-ICML25。
To CoT or To Loop? A Formal Comparison Between Chain-of-Thought and Looped Transformers
Abstract
arXiv:2505.19245v1 Announce Type: cross Abstract: Chain-of-Thought (CoT) and Looped Transformers have been shown to empirically improve performance on reasoning tasks and to theoretically enhance expressivity by recursively increasing the number of computational steps. However, their comparative capabilities are still not well understood. In this paper, we provide a formal analysis of their respective strengths and limitations. We show that Looped Transformers can efficiently simulate parallel computations for deterministic tasks, which we formalize as evaluation over directed acyclic graphs. In contrast, CoT with stochastic decoding excels at approximate inference for compositional structures, namely self-reducible problems. These separations suggest the tasks for which depth-driven recursion is more suitable, thereby offering practical cues for choosing between reasoning paradigms.
摘要
思维链(CoT)与循环变压器已被实证证明能提升推理任务表现,并在理论上通过递归增加计算步数来增强表达能力。然而,二者的相对能力仍未被充分理解。本文对其各自优势与局限进行了形式化分析:循环变压器可高效模拟确定性任务(形式化为有向无环图求值)的并行计算,而采用随机解码的CoT则擅长组合结构(即自可归约问题)的近似推理。这些差异揭示了深度驱动递归更适用的任务类型,从而为推理范式的选择提供了实践依据。
LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling
Abstract
arXiv:2505.19187v1 Announce Type: cross Abstract: Large language models (LLMs) have demonstrated remarkable reasoning capabilities through test-time scaling approaches, particularly when fine-tuned with chain-of-thought (CoT) data distilled from more powerful large reasoning models (LRMs). However, these reasoning chains often contain verbose elements that mirror human problem-solving, categorized as progressive reasoning (the essential solution development path) and functional elements (verification processes, alternative solution approaches, and error corrections). While progressive reasoning is crucial, the functional elements significantly increase computational demands during test-time inference. We introduce PIR (Perplexity-based Importance Refinement), a principled framework that quantitatively evaluates the importance of each reasoning step based on its impact on answer prediction confidence. PIR systematically identifies and selectively prunes only low-importance functional steps while preserving progressive reasoning components, creating optimized training data that maintains the integrity of the core solution path while reducing verbosity. Models fine-tuned on PIR-optimized data exhibit superior test-time scaling properties, generating more concise reasoning chains while achieving improved accuracy (+0.9% to +6.6%) with significantly reduced token usage (-3% to -41%) across challenging reasoning benchmarks (AIME, AMC, and GPQA Diamond). Our approach demonstrates strong generalizability across different model sizes, data sources, and token budgets, offering a practical solution for deploying reasoning-capable LLMs in scenarios where efficient test-time scaling, response time, and computational efficiency are valuable constraints.
摘要
大型语言模型(LLMs)通过测试时扩展方法展现出卓越的推理能力,尤其是在使用从更强大的大型推理模型(LRMs)中提炼的思维链(CoT)数据进行微调时。然而,这些推理链通常包含反映人类问题解决过程的冗余元素,可分为渐进式推理(核心解决方案的构建路径)和功能性元素(验证过程、替代解法及错误修正)。虽然渐进式推理至关重要,但功能性元素会显著增加测试时推理的计算负担。我们提出PIR(基于困惑度的重要性优化框架),该原则性框架通过量化评估每个推理步骤对答案预测置信度的影响来确定其重要性。PIR系统性地识别并选择性剪枝低重要性功能步骤,同时保留渐进式推理成分,从而生成保持核心解决路径完整性且减少冗余的优化训练数据。基于PIR优化数据微调的模型展现出更优的测试时扩展特性:在AIME、AMC和GPQA Diamond等具有挑战性的推理基准测试中,模型生成的推理链更简洁,准确率提升(+0.9%至+6.6%),同时显著降低token使用量(-3%至-41%)。该方法在不同模型规模、数据源和token预算条件下均表现出强泛化能力,为在测试时扩展效率、响应时间和计算效率受限场景中部署具备推理能力的LLMs提供了实用解决方案。
ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment
Abstract
arXiv:2505.19241v1 Announce Type: cross Abstract: The recent success of using human preferences to align large language models (LLMs) has significantly improved their performance in various downstream tasks like question answering, mathematical reasoning, and code generation. However,3 achieving effective LLM alignment depends on high-quality human preference datasets. Collecting these datasets requires human preference annotation, which is costly and resource-intensive, necessitating efficient active data selection methods. Existing methods either lack a strong theoretical foundation or depend on restrictive reward function assumptions (e.g., linearity). To this end, we propose an algorithm, ActiveDPO, that uses a theoretically grounded data selection criterion for non-linear reward functions while directly leveraging the LLM itself to parameterize the reward model that is used for active data selection. As a result, ActiveDPO explicitly accounts for the influence of LLM on data selection, unlike methods that select the data without considering the LLM that is being aligned, thereby leading to more effective and efficient data collection. Extensive experiments show that ActiveDPO outperforms existing methods across various models and datasets.
摘要
近期利用人类偏好对齐大型语言模型(LLMs)的成功显著提升了其在问答、数学推理和代码生成等下游任务中的表现。然而,实现有效的LLM对齐依赖于高质量的人类偏好数据集。收集这些数据需要进行人工偏好标注,成本高昂且资源密集,因此需要高效的数据主动选择方法。现有方法要么缺乏坚实的理论基础,要么依赖于严格的奖励函数假设(如线性)。为此,我们提出了一种算法ActiveDPO,该算法基于理论依据为非线性的奖励函数设计数据选择标准,并直接利用LLM本身参数化用于主动数据选择的奖励模型。与不考虑待对齐LLM影响的数据选择方法不同,ActiveDPO显式地考虑了LLM对数据选择的影响,从而实现更高效的数据收集。大量实验表明,ActiveDPO在不同模型和数据集上均优于现有方法。
Two LLMs debate, both are certain they've won
Abstract
arXiv:2505.19184v1 Announce Type: cross Abstract: Can LLMs accurately adjust their confidence when facing opposition? Building on previous studies measuring calibration on static fact-based question-answering tasks, we evaluate Large Language Models (LLMs) in a dynamic, adversarial debate setting, uniquely combining two realistic factors: (a) a multi-turn format requiring models to update beliefs as new information emerges, and (b) a zero-sum structure to control for task-related uncertainty, since mutual high-confidence claims imply systematic overconfidence. We organized 60 three-round policy debates among ten state-of-the-art LLMs, with models privately rating their confidence (0-100) in winning after each round. We observed five concerning patterns: (1) Systematic overconfidence: models began debates with average initial confidence of 72.9% vs. a rational 50% baseline. (2) Confidence escalation: rather than reducing confidence as debates progressed, debaters increased their win probabilities, averaging 83% by the final round. (3) Mutual overestimation: in 61.7% of debates, both sides simultaneously claimed >=75% probability of victory, a logical impossibility. (4) Persistent self-debate bias: models debating identical copies increased confidence from 64.1% to 75.2%; even when explicitly informed their chance of winning was exactly 50%, confidence still rose (from 50.0% to 57.1%). (5) Misaligned private reasoning: models' private scratchpad thoughts sometimes differed from their public confidence ratings, raising concerns about faithfulness of chain-of-thought reasoning. These results suggest LLMs lack the ability to accurately self-assess or update their beliefs in dynamic, multi-turn tasks; a major concern as LLM outputs are deployed without careful review in assistant roles or agentic settings.
摘要
大型语言模型能否在面对反对意见时准确调整其置信度?基于先前针对静态事实问答任务校准度的研究,我们在动态对抗性辩论场景中评估了大语言模型(LLMs),该设置独特地结合了两个现实因素:(a)需要模型根据新出现信息更新信念的多轮对话形式;(b)用于控制任务相关不确定性的零和结构——因为双方同时高置信度的主张意味着系统性过度自信。我们组织了十种前沿LLM参与的60场三轮政策辩论,模型在每轮结束后私下评估其获胜置信度(0-100)。观察到五个值得关注的现象:(1)系统性过度自信:模型初始平均置信度为72.9%,而理性基线应为50%;(2)置信度升级:随着辩论推进,辩手反而提高获胜概率,最终轮平均达83%;(3)相互高估:61.7%的辩论中出现双方同时宣称≥75%胜率的逻辑矛盾;(4)持续性自我辩论偏差:与相同副本辩论时,模型置信度从64.1%升至75.2%;即使明确告知胜率应为50%,置信度仍从50.0%上升至57.1%;(5)非对齐的私有推理:模型的私有推理过程有时与其公开置信度评级不一致,引发对思维链推理可信度的担忧。这些结果表明LLMs在动态多轮任务中缺乏准确自我评估或更新信念的能力,当LLM输出被未经审慎核查地部署于助手角色或自主场景时,将构成重大隐患。
When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas
Abstract
arXiv:2505.19212v1 Announce Type: cross Abstract: Recent advances in large language models (LLMs) have enabled their use in complex agentic roles, involving decision-making with humans or other agents, making ethical alignment a key AI safety concern. While prior work has examined both LLMs' moral judgment and strategic behavior in social dilemmas, there is limited understanding of how they act when moral imperatives directly conflict with rewards or incentives. To investigate this, we introduce Moral Behavior in Social Dilemma Simulation (MoralSim) and evaluate how LLMs behave in the prisoner's dilemma and public goods game with morally charged contexts. In MoralSim, we test a range of frontier models across both game structures and three distinct moral framings, enabling a systematic examination of how LLMs navigate social dilemmas in which ethical norms conflict with payoff-maximizing strategies. Our results show substantial variation across models in both their general tendency to act morally and the consistency of their behavior across game types, the specific moral framing, and situational factors such as opponent behavior and survival risks. Crucially, no model exhibits consistently moral behavior in MoralSim, highlighting the need for caution when deploying LLMs in agentic roles where the agent's "self-interest" may conflict with ethical expectations. Our code is available at https://github.com/sbackmann/moralsim.
摘要
大语言模型(LLMs)的最新进展使其能够承担复杂的代理角色,涉及与人类或其他代理的决策过程,这使得伦理对齐成为人工智能安全的关键问题。尽管先前研究已考察过LLMs在社会困境中的道德判断和策略行为,但对其在道德要求与利益激励直接冲突时的行为机制仍缺乏深入理解。为此,我们开发了'社会困境模拟中的道德行为'(MoralSim)系统,通过囚徒困境和公共物品博弈的道德情境设置来评估LLMs的行为模式。在MoralSim中,我们测试了多种前沿模型在两种博弈结构和三种不同道德框架下的表现,从而系统性地研究LLMs如何在伦理规范与收益最大化策略相冲突的社会困境中做出抉择。研究结果显示,不同模型在道德行为总体倾向性方面存在显著差异,其行为一致性也随博弈类型、特定道德框架以及对手行为、生存风险等情境因素而变化。关键发现是:所有模型在MoralSim中均未表现出持续稳定的道德行为,这警示我们在LLMs可能面临'自身利益'与伦理期望冲突的代理角色部署中需保持审慎。代码开源地址:https://github.com/sbackmann/moralsim。
LLLMs: A Data-Driven Survey of Evolving Research on Limitations of Large Language Models
Abstract
arXiv:2505.19240v1 Announce Type: cross Abstract: Large language model (LLM) research has grown rapidly, along with increasing concern about their limitations such as failures in reasoning, hallucinations, and limited multilingual capability. In this survey, we conduct a data-driven, semi-automated review of research on limitations of LLM (LLLMs) from 2022 to 2024 using a bottom-up approach. From a corpus of 250,000 ACL and arXiv papers, we identify 14,648 relevant papers using keyword filtering, LLM-based classification, validated against expert labels, and topic clustering (via two approaches, HDBSCAN+BERTopic and LlooM). We find that LLM-related research increases over fivefold in ACL and fourfold in arXiv. Since 2022, LLLMs research grows even faster, reaching over 30% of LLM papers by late 2024. Reasoning remains the most studied limitation, followed by generalization, hallucination, bias, and security. The distribution of topics in the ACL dataset stays relatively stable over time, while arXiv shifts toward safety and controllability (with topics like security risks, alignment, hallucinations, knowledge editing), and multimodality between 2022 and 2024. We release a dataset of annotated abstracts and a validated methodology, and offer a quantitative view of trends in LLM limitations research.
摘要
随着大语言模型(LLM)研究的快速发展,人们对其局限性(如推理失败、幻觉问题及多语言能力不足)的关注也日益增加。本综述采用自下而上的方法,对2022至2024年间关于LLM局限性(LLLMs)的研究进行了数据驱动的半自动化回顾。通过从25万篇ACL和arXiv论文中筛选,我们结合关键词过滤、基于LLM的分类(经专家标注验证)以及主题聚类(采用HDBSCAN+BERTopic与LlooM两种方法),最终确定了14,648篇相关文献。研究发现:ACL中LLM相关研究增长超五倍,arXiv中增长四倍;自2022年起,LLLMs研究增速更快,至2024年末已占LLM论文的30%以上。推理仍是研究最多的局限领域,其次为泛化性、幻觉、偏见和安全性。ACL数据集的主题分布相对稳定,而arXiv在2022至2024年间转向安全可控性(如安全风险、对齐、幻觉、知识编辑等主题)与多模态研究。我们公开了标注摘要数据集及验证方法,为LLM局限性研究趋势提供了量化视角。
MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery via Hierarchical Search
Abstract
arXiv:2505.19209v1 Announce Type: cross Abstract: Large language models (LLMs) have shown promise in automating scientific hypothesis generation, yet existing approaches primarily yield coarse-grained hypotheses lacking critical methodological and experimental details. We introduce and formally define the novel task of fine-grained scientific hypothesis discovery, which entails generating detailed, experimentally actionable hypotheses from coarse initial research directions. We frame this as a combinatorial optimization problem and investigate the upper limits of LLMs' capacity to solve it when maximally leveraged. Specifically, we explore four foundational questions: (1) how to best harness an LLM's internal heuristics to formulate the fine-grained hypothesis it itself would judge as the most promising among all the possible hypotheses it might generate, based on its own internal scoring-thus defining a latent reward landscape over the hypothesis space; (2) whether such LLM-judged better hypotheses exhibit stronger alignment with ground-truth hypotheses; (3) whether shaping the reward landscape using an ensemble of diverse LLMs of similar capacity yields better outcomes than defining it with repeated instances of the strongest LLM among them; and (4) whether an ensemble of identical LLMs provides a more reliable reward landscape than a single LLM. To address these questions, we propose a hierarchical search method that incrementally proposes and integrates details into the hypothesis, progressing from general concepts to specific experimental configurations. We show that this hierarchical process smooths the reward landscape and enables more effective optimization. Empirical evaluations on a new benchmark of expert-annotated fine-grained hypotheses from recent chemistry literature show that our method consistently outperforms strong baselines.
摘要
大语言模型(LLMs)在自动化科学假设生成方面展现出潜力,但现有方法主要产生粗粒度的假设,缺乏关键的方法论和实验细节。我们提出并正式定义了细粒度科学假设发现这一新任务,其目标是从初始的粗粒度研究方向生成详细且可实验操作的假设。我们将此任务构建为一个组合优化问题,并探究在最大限度利用LLMs时其解决该问题的能力上限。具体而言,我们探讨了四个基础问题:(1)如何最佳利用LLM的内部启发式方法,使其生成自身基于内部评分认为最有潜力的细粒度假设——从而在假设空间上定义一个潜在的奖励景观;(2)此类由LLM判定为更优的假设是否与真实假设表现出更强的一致性;(3)使用一组能力相近的多样化LLM塑造奖励景观,是否比使用其中最强LLM的重复实例定义奖励景观能产生更好的结果;(4)一组相同的LLM是否比单个LLM提供更可靠的奖励景观。为解决这些问题,我们提出了一种分层搜索方法,该方法从一般概念逐步推进到具体实验配置,逐步提出并将细节整合到假设中。我们证明这一分层过程能够平滑奖励景观并实现更有效的优化。在基于近期化学文献中专家标注的细粒度假设新基准上的实证评估表明,我们的方法 consistently 优于强基线模型。
VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use
Abstract
arXiv:2505.19255v1 Announce Type: cross Abstract: Reinforcement Learning Finetuning (RFT) has significantly advanced the reasoning capabilities of large language models (LLMs) by enabling long chains of thought, self-correction, and effective tool use. While recent works attempt to extend RFT to vision-language models (VLMs), these efforts largely produce text-only reasoning conditioned on static image inputs, falling short of true multimodal reasoning in the response. In contrast, test-time methods like Visual Sketchpad incorporate visual steps but lack training mechanisms. We introduce VTool-R1, the first framework that trains VLMs to generate multimodal chains of thought by interleaving text and intermediate visual reasoning steps. VTool-R1 integrates Python-based visual editing tools into the RFT process, enabling VLMs to learn when and how to generate visual reasoning steps that benefit final reasoning. Trained with outcome-based rewards tied to task accuracy, our approach elicits strategic visual tool use for reasoning without relying on process-based supervision. Experiments on structured visual question answering over charts and tables show that VTool-R1 enhances reasoning performance by teaching VLMs to "think with images" and generate multimodal chain of thoughts with tools.
摘要
强化学习微调(RFT)通过实现长链思维、自我修正和有效工具使用,显著提升了大型语言模型(LLMs)的推理能力。尽管近期研究尝试将RFT扩展至视觉语言模型(VLMs),但这些工作主要生成基于静态图像输入的纯文本推理,未能实现响应中真正的多模态推理。相比之下,像Visual Sketchpad这样的测试时方法虽然包含视觉步骤,但缺乏训练机制。 我们提出VTool-R1——首个通过交错文本与中间视觉推理步骤来训练VLMs生成多模态思维链的框架。VTool-R1将基于Python的视觉编辑工具集成到RFT流程中,使VLMs能学习何时及如何生成有益于最终推理的视觉推理步骤。通过绑定任务准确度的结果导向奖励进行训练,我们的方法无需依赖过程监督即可激发策略性视觉工具使用以支持推理。在图表结构化视觉问答任务上的实验表明,VTool-R1通过教导VLMs"用图像思考"并生成基于工具的多模态思维链,显著提升了推理性能。
Towards Large Reasoning Models for Agriculture
Abstract
arXiv:2505.19259v1 Announce Type: cross Abstract: Agricultural decision-making involves complex, context-specific reasoning, where choices about crops, practices, and interventions depend heavily on geographic, climatic, and economic conditions. Traditional large language models (LLMs) often fall short in navigating this nuanced problem due to limited reasoning capacity. We hypothesize that recent advances in large reasoning models (LRMs) can better handle such structured, domain-specific inference. To investigate this, we introduce AgReason, the first expert-curated open-ended science benchmark with 100 questions for agricultural reasoning. Evaluations across thirteen open-source and proprietary models reveal that LRMs outperform conventional ones, though notable challenges persist, with the strongest Gemini-based baseline achieving 36% accuracy. We also present AgThoughts, a large-scale dataset of 44.6K question-answer pairs generated with human oversight and equipped with synthetically generated reasoning traces. Using AgThoughts, we develop AgThinker, a suite of small reasoning models that can be run on consumer-grade GPUs, and show that our dataset can be effective in unlocking agricultural reasoning abilities in LLMs. Our project page is here: https://baskargroup.github.io/Ag_reasoning/
摘要
农业决策涉及复杂且情境特定的推理过程,作物选择、实践措施及干预方案的制定高度依赖于地理、气候和经济条件。传统大语言模型(LLMs)由于推理能力有限,往往难以应对这种具有细微差异的问题。我们假设近期发展的大规模推理模型(LRMs)能更好地处理此类结构化、领域特定的推断任务。为验证该假设,我们推出AgReason——首个由专家构建的开放式科学基准测试,包含100道农业推理问题。通过对13个开源和专有模型的评估发现,尽管仍存在显著挑战,LRMs表现优于传统模型,其中基于Gemini的最强基线模型准确率达到36%。我们还提出AgThoughts数据集,该大规模数据集包含44.6K个经人工监督生成的问答对,并配备合成生成的推理轨迹。利用AgThoughts,我们开发了可在消费级GPU上运行的轻量级推理模型套件AgThinker,证明该数据集能有效激发LLMs的农业推理能力。项目页面详见:https://baskargroup.github.io/Ag_reasoning/
Enhancing Text-to-Image Diffusion Transformer via Split-Text Conditioning
Abstract
arXiv:2505.19261v1 Announce Type: cross Abstract: Current text-to-image diffusion generation typically employs complete-text conditioning. Due to the intricate syntax, diffusion transformers (DiTs) inherently suffer from a comprehension defect of complete-text captions. One-fly complete-text input either overlooks critical semantic details or causes semantic confusion by simultaneously modeling diverse semantic primitive types. To mitigate this defect of DiTs, we propose a novel split-text conditioning framework named DiT-ST. This framework converts a complete-text caption into a split-text caption, a collection of simplified sentences, to explicitly express various semantic primitives and their interconnections. The split-text caption is then injected into different denoising stages of DiT-ST in a hierarchical and incremental manner. Specifically, DiT-ST leverages Large Language Models to parse captions, extracting diverse primitives and hierarchically sorting out and constructing these primitives into a split-text input. Moreover, we partition the diffusion denoising process according to its differential sensitivities to diverse semantic primitive types and determine the appropriate timesteps to incrementally inject tokens of diverse semantic primitive types into input tokens via cross-attention. In this way, DiT-ST enhances the representation learning of specific semantic primitive types across different stages. Extensive experiments validate the effectiveness of our proposed DiT-ST in mitigating the complete-text comprehension defect.
摘要
当前文本到图像的扩散生成通常采用完整文本条件输入。由于复杂语法结构,扩散变换器(DiTs)固有地存在对完整文本描述的理解缺陷:一次性完整文本输入要么忽略关键语义细节,要么因同时建模多种语义基元类型而导致语义混淆。为缓解DiTs的这一缺陷,我们提出名为DiT-ST的新型拆分文本条件框架。该框架将完整文本描述转换为由简化句子组成的拆分文本描述,以显式表达各类语义基元及其相互关系。随后通过分层渐进方式,将拆分文本描述注入DiT-ST的不同去噪阶段。具体而言,DiT-ST利用大语言模型解析描述文本,提取多样化语义基元,并分层梳理构建为拆分文本输入。此外,我们根据扩散去噪过程对不同语义基元类型的差异敏感性进行阶段划分,确定合适时间步,通过交叉注意力机制将各类语义基元的标记逐步注入输入标记。这种方法增强了DiT-ST在不同阶段对特定语义基元类型的表征学习能力。大量实验验证了所提DiT-ST在缓解完整文本理解缺陷方面的有效性。
100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?
Abstract
arXiv:2505.19293v1 Announce Type: cross Abstract: Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM enables users to effortlessly process many originally exhausting tasks -- e.g., digesting a long-form document to find answers vs. directly asking an LLM about it. However, existing real-task-based long-context evaluation benchmarks have two major shortcomings. First, benchmarks like LongBench often do not provide proper metrics to separate long-context performance from the model's baseline ability, making cross-model comparison unclear. Second, such benchmarks are usually constructed with fixed input lengths, which limits their applicability across different models and fails to reveal when a model begins to break down. To address these issues, we introduce a length-controllable long-context benchmark and a novel metric that disentangles baseline knowledge from true long-context capabilities. Experiments demonstrate the superiority of our approach in effectively evaluating LLMs.
摘要
长上下文能力被视为大语言模型(LLM)最重要的能力之一,因为真正具备长上下文处理能力的LLM能让用户轻松完成许多原本繁琐的任务——例如通过消化长文本文档来寻找答案,而非直接向LLM提问。然而,现有基于真实任务的长上下文评估基准存在两大缺陷。首先,像LongBench这样的基准通常无法提供合适的指标来区分长上下文性能与模型的基线能力,导致跨模型比较不够清晰。其次,这类基准通常以固定输入长度构建,限制了其在不同模型间的适用性,且无法揭示模型何时开始失效。为解决这些问题,我们提出了一个长度可调的长上下文基准和一个新颖的评估指标,该指标能将基线知识与真实的长上下文能力分离。实验证明我们的方法在有效评估LLM方面具有优越性。
A Necessary Step toward Faithfulness: Measuring and Improving Consistency in Free-Text Explanations
Abstract
arXiv:2505.19299v1 Announce Type: cross Abstract: Faithful free-text explanations are important to ensure transparency in high-stakes AI decision-making contexts, but they are challenging to generate by language models and assess by humans. In this paper, we present a measure for Prediction-EXplanation (PEX) consistency, by extending the concept of weight of evidence. This measure quantifies how much a free-text explanation supports or opposes a prediction, serving as an important aspect of explanation faithfulness. Our analysis reveals that more than 62% explanations generated by large language models lack this consistency. We show that applying direct preference optimization improves the consistency of generated explanations across three model families, with improvement ranging from 43.1% to 292.3%. Furthermore, we demonstrate that optimizing this consistency measure can improve explanation faithfulness by up to 9.7%.
摘要
可靠的自由文本解释对于确保高风险人工智能决策场景的透明度至关重要,但语言模型生成这类解释以及人工评估均存在挑战。本文通过扩展证据权重概念,提出了一种预测-解释(PEX)一致性度量方法。该指标量化了自由文本解释对预测的支持或反对程度,是解释可信度的重要维度。分析表明,超过62%由大语言模型生成的解释缺乏这种一致性。研究发现,采用直接偏好优化方法可提升三个模型族生成解释的一致性,改进幅度介于43.1%至292.3%之间。此外,优化该一致性指标能使解释可信度最高提升9.7%。
Retrieval-Augmented Generation for Service Discovery: Chunking Strategies and Benchmarking
Abstract
arXiv:2505.19310v1 Announce Type: cross Abstract: Integrating multiple (sub-)systems is essential to create advanced Information Systems. Difficulties mainly arise when integrating dynamic environments, e.g., the integration at design time of not yet existing services. This has been traditionally addressed using a registry that provides the API documentation of the endpoints. Large Language Models have shown to be capable of automatically creating system integrations (e.g., as service composition) based on this documentation but require concise input due to input oken limitations, especially regarding comprehensive API descriptions. Currently, it is unknown how best to preprocess these API descriptions. In the present work, we (i) analyze the usage of Retrieval Augmented Generation for endpoint discovery and the chunking, i.e., preprocessing, of state-of-practice OpenAPIs to reduce the input oken length while preserving the most relevant information. To further reduce the input token length for the composition prompt and improve endpoint retrieval, we propose (ii) a Discovery Agent that only receives a summary of the most relevant endpoints nd retrieves specification details on demand. We evaluate RAG for endpoint discovery using (iii) a proposed novel service discovery benchmark SOCBench-D representing a general setting across numerous domains and the real-world RestBench enchmark, first, for the different chunking possibilities and parameters measuring the endpoint retrieval accuracy. Then, we assess the Discovery Agent using the same test data set. The prototype shows how to successfully employ RAG for endpoint discovery to reduce the token count. Our experiments show that endpoint-based approaches outperform naive chunking methods for preprocessing. Relying on an agent significantly improves precision while being prone to decrease recall, disclosing the need for further reasoning capabilities.
摘要
整合多个(子)系统对于构建高级信息系统至关重要。当涉及动态环境集成时(例如在设计阶段集成尚未存在的服务),主要困难随之产生。传统解决方案依赖于提供终端API文档的注册中心。大型语言模型已展现出基于此类文档自动实现系统集成(如服务组合)的能力,但由于输入令牌限制(尤其针对综合性API描述),需要精简的输入内容。目前,关于如何最优预处理这些API描述尚未形成共识。本研究(i)分析了检索增强生成技术在终端发现中的应用,以及对现行OpenAPI进行分块预处理的方法,旨在缩减输入令牌长度的同时保留最关键信息。为进一步减少组合提示的输入令牌长度并提升终端检索效率,我们提出(ii)发现代理机制——该代理仅接收最相关终端的摘要,并按需获取详细规范说明。我们通过(iii)新开发的多领域通用服务发现基准SOCBench-D和真实场景RestBench基准,首先针对不同分块方案及参数评估终端检索准确率,继而使用相同测试数据集验证发现代理性能。原型系统证明了检索增强生成技术能有效降低终端发现的令牌消耗。实验表明:基于终端的预处理方法优于简单分块策略,而代理机制在显著提升精度的同时可能降低召回率,这揭示了对进一步推理能力的需求。
Communication-Efficient Multi-Device Inference Acceleration for Transformer Models
Abstract
arXiv:2505.19342v1 Announce Type: cross Abstract: Transformer models power many AI applications but suffer from high inference latency, limiting their use in real-time settings. Multi-device inference can reduce latency by parallelizing computation. Yet, existing methods require high inter-device bandwidth, making them impractical for bandwidth-constrained environments. We propose ASTRA, a communication-efficient framework that accelerates Transformer inference through a novel integration of sequence parallelism and a Mixed-Precision Attention mechanism designed to minimize inter-device communication. ASTRA compresses non-local token embeddings via vector quantization and preserves task accuracy through two optimizations, Noise-Augmented Quantization and Distributed Class Tokens. Experiments on ViT and GPT2 across vision and NLP tasks show that ASTRA achieves up to 2.64X speedups over single-device inference and up to 15.25X speedups over state-of-the-art multi-device inferences, while operating under bandwidths as low as 10 Mbps. ASTRA is open-sourced at https://github.com/xl1990/Astra.
摘要
Transformer模型虽驱动众多AI应用,但其高推理延迟限制了实时场景下的使用。多设备推理可通过并行计算降低延迟,然而现有方法需要高设备间带宽,在带宽受限环境中难以实用。我们提出ASTRA框架,该通信高效方案通过序列并行与混合精度注意力机制创新性结合来加速Transformer推理,旨在最小化设备间通信。ASTRA采用向量量化压缩非局部令牌嵌入,并通过噪声增强量化和分布式类别令牌两项优化保持任务精度。在视觉与NLP任务中对ViT和GPT2的实验表明,ASTRA相比单设备推理最高实现2.64倍加速,较先进多设备推理方案最高达15.25倍加速,且可在低至10 Mbps带宽下运行。ASTRA已开源:https://github.com/xl1990/Astra。
Simple and Effective Baselines for Code Summarisation Evaluation
Abstract
arXiv:2505.19392v1 Announce Type: cross Abstract: Code documentation is useful, but writing it is time-consuming. Different techniques for generating code summaries have emerged, but comparing them is difficult because human evaluation is expensive and automatic metrics are unreliable. In this paper, we introduce a simple new baseline in which we ask an LLM to give an overall score to a summary. Unlike n-gram and embedding-based baselines, our approach is able to consider the code when giving a score. This allows us to also make a variant that does not consider the reference summary at all, which could be used for other tasks, e.g., to evaluate the quality of documentation in code bases. We find that our method is as good or better than prior metrics, though we recommend using it in conjunction with embedding-based methods to avoid the risk of LLM-specific bias.
摘要
代码文档具有实用价值,但编写过程耗时。尽管已出现多种生成代码摘要的技术,但由于人工评估成本高昂且自动度量指标不可靠,比较这些技术存在困难。本文提出一种简单的新基线方法:通过大型语言模型(LLM)对摘要进行整体评分。与基于n-gram和嵌入的基线方法不同,我们的方法能在评分时考虑代码本身。这使得我们可以构建完全不参考原始摘要的变体,该方法还可应用于其他任务(例如评估代码库中文档的质量)。研究发现,尽管建议与基于嵌入的方法结合使用以避免LLM特定偏差的风险,但本方法优于或等同于现有度量指标。
It's Not Just Labeling" -- A Research on LLM Generated Feedback Interpretability and Image Labeling Sketch Features
Abstract
arXiv:2505.19419v1 Announce Type: cross Abstract: The quality of training data is critical to the performance of machine learning applications in domains like transportation, healthcare, and robotics. Accurate image labeling, however, often relies on time-consuming, expert-driven methods with limited feedback. This research introduces a sketch-based annotation approach supported by large language models (LLMs) to reduce technical barriers and enhance accessibility. Using a synthetic dataset, we examine how sketch recognition features relate to LLM feedback metrics, aiming to improve the reliability and interpretability of LLM-assisted labeling. We also explore how prompting strategies and sketch variations influence feedback quality. Our main contribution is a sketch-based virtual assistant that simplifies annotation for non-experts and advances LLM-driven labeling tools in terms of scalability, accessibility, and explainability.
摘要
训练数据的质量对机器学习在交通、医疗和机器人等领域的应用性能至关重要。然而,准确的图像标注通常依赖于耗时且反馈有限的专家驱动方法。本研究提出了一种基于草图标注的方法,并借助大语言模型(LLMs)降低技术门槛、提升可及性。通过使用合成数据集,我们探究了草图识别特征与大语言模型反馈指标之间的关联,旨在提高LLM辅助标注的可靠性和可解释性。同时,我们还研究了提示策略和草图变体对反馈质量的影响。我们的主要贡献是开发了一个基于草图的虚拟助手,该工具不仅简化了非专业人士的标注流程,还在可扩展性、可访问性和可解释性方面推动了LLM驱动的标注工具发展。
Alignment of large language models with constrained learning
Abstract
arXiv:2505.19387v1 Announce Type: cross Abstract: We study the problem of computing an optimal large language model (LLM) policy for a constrained alignment problem, where the goal is to maximize a primary reward objective while satisfying constraints on secondary utilities. Despite the popularity of Lagrangian-based LLM policy search in constrained alignment, iterative primal-dual methods often fail to converge, and non-iterative dual-based methods do not achieve optimality in the LLM parameter space. To address these challenges, we employ Lagrangian duality to develop an iterative dual-based alignment method that alternates between updating the LLM policy via Lagrangian maximization and updating the dual variable via dual descent. In theory, we characterize the primal-dual gap between the primal value in the distribution space and the dual value in the LLM parameter space. We further quantify the optimality gap of the learned LLM policies at near-optimal dual variables with respect to both the objective and the constraint functions. These results prove that dual-based alignment methods can find an optimal constrained LLM policy, up to an LLM parametrization gap. We demonstrate the effectiveness and merits of our approach through extensive experiments conducted on the PKU-SafeRLHF dataset.
摘要
我们研究如何为受限对齐问题计算最优大语言模型(LLM)策略,其目标是在满足次要效用约束条件下最大化主要奖励目标。尽管基于拉格朗日方法的LLM策略搜索在受限对齐中广受欢迎,但迭代的原始对偶方法常无法收敛,而非迭代的对偶方法在LLM参数空间中无法达到最优性。为解决这些挑战,我们运用拉格朗日对偶性开发了一种迭代对偶对齐方法,该方法通过在拉格朗日最大化更新LLM策略与对偶下降更新对偶变量之间交替进行。理论上,我们刻画了分布空间中的原始值与LLM参数空间中对偶值之间的原始对偶间隙。我们进一步量化了在近优对偶变量下所学LLM策略关于目标函数和约束函数的最优性间隙。这些结果证明对偶对齐方法能找到最优受限LLM策略(直至LLM参数化间隙)。通过在PKU-SafeRLHF数据集上的大量实验,我们验证了所提方法的有效性和优势。
PatentScore: Multi-dimensional Evaluation of LLM-Generated Patent Claims
Abstract
arXiv:2505.19345v1 Announce Type: cross Abstract: Natural language generation (NLG) metrics play a central role in evaluating generated texts, but are not well suited for the structural and legal characteristics of patent documents. Large language models (LLMs) offer strong potential in automating patent generation, yet research on evaluating LLM-generated patents remains limited, especially in evaluating the generation quality of patent claims, which are central to defining the scope of protection. Effective claim evaluation requires addressing legal validity, technical accuracy, and structural compliance. To address this gap, we introduce PatentScore, a multi-dimensional evaluation framework for assessing LLM-generated patent claims. PatentScore incorporates: (1) hierarchical decomposition for claim analysis; (2) domain-specific validation patterns based on legal and technical standards; and (3) scoring across structural, semantic, and legal dimensions. Unlike general-purpose NLG metrics, PatentScore reflects patent-specific constraints and document structures, enabling evaluation beyond surface similarity. We evaluate 400 GPT-4o-mini generated Claim 1s and report a Pearson correlation of with expert annotations, outperforming existing NLG metrics. Furthermore, we conduct additional evaluations using open models such as Claude-3.5-Haiku and Gemini-1.5-flash, all of which show strong correlations with expert judgments, confirming the robustness and generalizability of our framework.
摘要
自然语言生成(NLG)指标在评估生成文本方面发挥着核心作用,但并不适合专利文件的结构和法律特征。大语言模型(LLM)在自动化专利生成方面展现出强大潜力,然而针对LLM生成专利的评估研究仍然有限,特别是在评估权利要求书的生成质量方面——这是界定保护范围的核心要素。有效的权利要求评估需要解决法律有效性、技术准确性和结构合规性等问题。为填补这一空白,我们提出了PatentScore,一个用于评估LLM生成专利权利要求的多维框架。PatentScore包含:(1)权利要求分析的层次化分解;(2)基于法律和技术标准的领域特定验证模式;(3)结构、语义和法律维度的评分体系。与通用NLG指标不同,PatentScore反映了专利特有的约束条件和 文档结构,能够实现超越表面相似性的深度评估。我们评估了400份GPT-4o-mini生成的权利要求1,报告其与专家标注的Pearson相关性达,优于现有NLG指标。此外,我们还使用Claude-3.5-Haiku和Gemini-1.5-flash等开源模型进行了补充评估,所有结果均显示与专家判断具有强相关性,证实了我们框架的稳健性和普适性。
The Role of Diversity in In-Context Learning for Large Language Models
Abstract
arXiv:2505.19426v1 Announce Type: cross Abstract: In-context learning (ICL) is a crucial capability of current large language models (LLMs), where the selection of examples plays a key role in performance. While most existing approaches focus on selecting the most similar examples to the query, the impact of diversity in example selection remains underexplored. We systematically investigate the role of diversity in in-context example selection through experiments across a range of tasks, from sentiment classification to more challenging math and code problems. Experiments on Llama-3.1, Gemma-2, and Mistral-v0.3 families of models show that diversity-aware selection methods improve performance, particularly on complex tasks like math and code, and enhance robustness to out-of-distribution queries. To support these findings, we introduce a theoretical framework that explains the benefits of incorporating diversity in in-context example selection.
摘要
上下文学习(ICL)是当前大语言模型(LLMs)的核心能力,其中示例的选择对性能至关重要。虽然现有方法大多侧重于选择与查询最相似的示例,但示例选择多样性的影响仍未得到充分探索。我们通过一系列实验(从情感分类到更具挑战性的数学和代码问题)系统研究了多样性在上下文示例选择中的作用。基于Llama-3.1、Gemma-2和Mistral-v0.3系列模型的实验表明,考虑多样性的选择方法能提升性能(尤其在数学和代码等复杂任务上),并增强对分布外查询的鲁棒性。为支持这些发现,我们提出了一个理论框架,用以解释在上下文示例选择中引入多样性的优势。
Deriving Strategic Market Insights with Large Language Models: A Benchmark for Forward Counterfactual Generation
Abstract
arXiv:2505.19430v1 Announce Type: cross Abstract: Counterfactual reasoning typically involves considering alternatives to actual events. While often applied to understand past events, a distinct form-forward counterfactual reasoning-focuses on anticipating plausible future developments. This type of reasoning is invaluable in dynamic financial markets, where anticipating market developments can powerfully unveil potential risks and opportunities for stakeholders, guiding their decision-making. However, performing this at scale is challenging due to the cognitive demands involved, underscoring the need for automated solutions. Large Language Models (LLMs) offer promise, but remain unexplored for this application. To address this gap, we introduce a novel benchmark, Fin-Force-FINancial FORward Counterfactual Evaluation. By curating financial news headlines and providing structured evaluation, Fin-Force supports LLM based forward counterfactual generation. This paves the way for scalable and automated solutions for exploring and anticipating future market developments, thereby providing structured insights for decision-making. Through experiments on Fin-Force, we evaluate state-of-the-art LLMs and counterfactual generation methods, analyzing their limitations and proposing insights for future research.
摘要
反事实推理通常涉及对实际事件的替代性考量。尽管该方法常用于理解过去事件,但一种独特的形式——前瞻性反事实推理——专注于预测未来可能的发展态势。这种推理方式在动态变化的金融市场中具有重要价值,通过预判市场走势能有效揭示利益相关者面临的潜在风险与机遇,从而指导决策。然而由于认知负荷的限制,大规模实施此类推理具有挑战性,这凸显了对自动化解决方案的需求。虽然大型语言模型(LLMs)展现出应用潜力,但在此领域的探索仍属空白。为填补这一缺口,我们提出了创新性基准Fin-Force(金融前瞻反事实评估),通过精选金融新闻标题并提供结构化评估框架,支持基于LLM的前瞻性反事实生成。这为开发可扩展的自动化解决方案以探索和预测未来市场发展铺平了道路,从而为决策提供结构化洞见。通过在Fin-Force上的实验,我们评估了前沿LLM及反事实生成方法的性能,分析了其局限性,并为未来研究提出了建设性见解。
Vibe Coding vs. Agentic Coding: Fundamentals and Practical Implications of Agentic AI
Abstract
arXiv:2505.19443v1 Announce Type: cross Abstract: This review presents a comprehensive analysis of two emerging paradigms in AI-assisted software development: vibe coding and agentic coding. While both leverage large language models (LLMs), they differ fundamentally in autonomy, architectural design, and the role of the developer. Vibe coding emphasizes intuitive, human-in-the-loop interaction through prompt-based, conversational workflows that support ideation, experimentation, and creative exploration. In contrast, agentic coding enables autonomous software development through goal-driven agents capable of planning, executing, testing, and iterating tasks with minimal human intervention. We propose a detailed taxonomy spanning conceptual foundations, execution models, feedback loops, safety mechanisms, debugging strategies, and real-world tool ecosystems. Through comparative workflow analysis and 20 detailed use cases, we illustrate how vibe systems thrive in early-stage prototyping and education, while agentic systems excel in enterprise-grade automation, codebase refactoring, and CI/CD integration. We further examine emerging trends in hybrid architectures, where natural language interfaces are coupled with autonomous execution pipelines. Finally, we articulate a future roadmap for agentic AI, outlining the infrastructure needed for trustworthy, explainable, and collaborative systems. Our findings suggest that successful AI software engineering will rely not on choosing one paradigm, but on harmonizing their strengths within a unified, human-centered development lifecycle.
摘要
本综述对AI辅助软件开发中的两大新兴范式——氛围编码与代理编码——进行了全面分析。尽管二者均依托大语言模型(LLMs),但在自主性、架构设计和开发者角色方面存在本质差异。氛围编码通过基于提示的对话式工作流,强调直觉性的人机协同交互,支持创意构思、实验验证和创造性探索;而代理编码则通过目标驱动的自主代理实现软件开发,这些代理能够以最小人工干预完成规划、执行、测试及迭代任务。我们提出了涵盖概念基础、执行模型、反馈机制、安全防护、调试策略及现实工具生态的详细分类体系。通过对比工作流分析和20个详细用例,阐明氛围系统在早期原型设计及教育领域表现突出,而代理系统更擅长企业级自动化、代码库重构及CI/CD集成。进一步探讨了混合架构的新兴趋势,即自然语言界面与自主执行管道的结合。最后提出了代理式AI的发展路线图,概述构建可信、可解释、可协作系统所需的基础设施。研究表明,成功的AI软件工程并非要选择单一范式,而是要在以人为本的统一开发生命周期中协调二者的优势。
VADER: A Human-Evaluated Benchmark for Vulnerability Assessment, Detection, Explanation, and Remediation
Abstract
arXiv:2505.19395v1 Announce Type: cross Abstract: Ensuring that large language models (LLMs) can effectively assess, detect, explain, and remediate software vulnerabilities is critical for building robust and secure software systems. We introduce VADER, a human-evaluated benchmark designed explicitly to assess LLM performance across four key vulnerability-handling dimensions: assessment, detection, explanation, and remediation. VADER comprises 174 real-world software vulnerabilities, each carefully curated from GitHub repositories and annotated by security experts. For each vulnerability case, models are tasked with identifying the flaw, classifying it using Common Weakness Enumeration (CWE), explaining its underlying cause, proposing a patch, and formulating a test plan. Using a one-shot prompting strategy, we benchmark six state-of-the-art LLMs (Claude 3.7 Sonnet, Gemini 2.5 Pro, GPT-4.1, GPT-4.5, Grok 3 Beta, and o3) on VADER, and human security experts evaluated each response according to a rigorous scoring rubric emphasizing remediation (quality of the code fix, 50%), explanation (20%), and classification and test plan (30%) according to a standardized rubric. Our results show that current state-of-the-art LLMs achieve only moderate success on VADER - OpenAI's o3 attained 54.7% accuracy overall, with others in the 49-54% range, indicating ample room for improvement. Notably, remediation quality is strongly correlated (Pearson r > 0.97) with accurate classification and test plans, suggesting that models that effectively categorize vulnerabilities also tend to fix them well. VADER's comprehensive dataset, detailed evaluation rubrics, scoring tools, and visualized results with confidence intervals are publicly released, providing the community with an interpretable, reproducible benchmark to advance vulnerability-aware LLMs. All code and data are available at: https://github.com/AfterQuery/vader
摘要
确保大语言模型(LLMs)能够有效评估、检测、解释和修复软件漏洞,对于构建健壮且安全的软件系统至关重要。我们提出了VADER,这是一个经过人工评估的基准测试,专门用于评估LLMs在漏洞处理的四个关键维度上的表现:评估、检测、解释和修复。VADER包含174个真实世界的软件漏洞,每个漏洞均从GitHub仓库中精心筛选并由安全专家标注。对于每个漏洞案例,模型需要识别缺陷、使用通用缺陷枚举(CWE)进行分类、解释其根本原因、提出修复补丁并制定测试计划。通过一次性提示策略,我们对六种最先进的LLMs(Claude 3.7 Sonnet、Gemini 2.5 Pro、GPT-4.1、GPT-4.5、Grok 3 Beta和o3)在VADER上进行了基准测试,并由安全专家根据严格的评分标准对每个回答进行评估,重点关注修复(代码修复质量,50%)、解释(20%)以及分类和测试计划(30%)。我们的结果表明,当前最先进的LLMs在VADER上仅取得中等成功——OpenAI的o3总体准确率为54.7%,其他模型在49%-54%之间,表明仍有较大改进空间。值得注意的是,修复质量与准确的分类和测试计划呈强相关性(Pearson r > 0.97),这表明能够有效分类漏洞的模型也往往能较好地修复它们。VADER的完整数据集、详细评估标准、评分工具以及带有置信区间的可视化结果已公开发布,为社区提供了一个可解释、可复现的基准,以推动漏洞感知LLMs的发展。所有代码和数据均可在以下网址获取:https://github.com/AfterQuery/vader
SIPDO: Closed-Loop Prompt Optimization via Synthetic Data Feedback
Abstract
arXiv:2505.19514v1 Announce Type: cross Abstract: Prompt quality plays a critical role in the performance of large language models (LLMs), motivating a growing body of work on prompt optimization. Most existing methods optimize prompts over a fixed dataset, assuming static input distributions and offering limited support for iterative improvement. We introduce SIPDO (Self-Improving Prompts through Data-Augmented Optimization), a closed-loop framework for prompt learning that integrates synthetic data generation into the optimization process. SIPDO couples a synthetic data generator with a prompt optimizer, where the generator produces new examples that reveal current prompt weaknesses and the optimizer incrementally refines the prompt in response. This feedback-driven loop enables systematic improvement of prompt performance without assuming access to external supervision or new tasks. Experiments across question answering and reasoning benchmarks show that SIPDO outperforms standard prompt tuning methods, highlighting the value of integrating data synthesis into prompt learning workflows.
摘要
提示词质量对大型语言模型(LLMs)的性能具有决定性影响,这促使了越来越多关于提示优化的研究。现有方法大多基于固定数据集进行提示优化,假设输入分布静态且缺乏对迭代改进的支持。我们提出SIPDO(基于数据增强优化的自改进提示框架),这是一种将合成数据生成整合至优化过程的闭环提示学习框架。SIPDO将合成数据生成器与提示优化器耦合:生成器通过暴露当前提示缺陷产生新样本,优化器则据此逐步优化提示。这种反馈驱动机制能在不依赖外部监督或新任务的前提下,实现提示性能的系统性提升。在问答和推理基准测试中的实验表明,SIPDO优于标准提示调优方法,验证了将数据合成融入提示学习工作流程的价值。
Win Fast or Lose Slow: Balancing Speed and Accuracy in Latency-Sensitive Decisions of LLMs
Abstract
arXiv:2505.19481v1 Announce Type: cross Abstract: Large language models (LLMs) have shown remarkable performance across diverse reasoning and generation tasks, and are increasingly deployed as agents in dynamic environments such as code generation and recommendation systems. However, many real-world applications, such as high-frequency trading and real-time competitive gaming, require decisions under strict latency constraints, where faster responses directly translate into higher rewards. Despite the importance of this latency quality trade off, it remains underexplored in the context of LLM based agents. In this work, we present the first systematic study of this trade off in real time decision making tasks. To support our investigation, we introduce two new benchmarks: HFTBench, a high frequency trading simulation, and StreetFighter, a competitive gaming platform. Our analysis reveals that optimal latency quality balance varies by task, and that sacrificing quality for lower latency can significantly enhance downstream performance. To address this, we propose FPX, an adaptive framework that dynamically selects model size and quantization level based on real time demands. Our method achieves the best performance on both benchmarks, improving win rate by up to 80% in Street Fighter and boosting daily yield by up to 26.52% in trading, underscoring the need for latency aware evaluation and deployment strategies for LLM based agents. These results demonstrate the critical importance of latency aware evaluation and deployment strategies for real world LLM based agents. Our benchmarks are available at Latency Sensitive Benchmarks.
摘要
大语言模型(LLMs)在多样化的推理与生成任务中展现出卓越性能,并日益作为智能体部署于代码生成和推荐系统等动态环境中。然而,高频交易和实时竞技游戏等现实应用场景需要严格延迟约束下的决策能力,其中更快的响应速度直接转化为更高收益。尽管这种延迟与质量的权衡至关重要,但在基于LLM的智能体研究中仍未被充分探索。本研究首次对实时决策任务中的这种权衡进行了系统性分析。为支持研究,我们引入两个新基准:HFTBench(高频交易模拟器)和StreetFighter(竞技游戏平台)。分析表明,最优延迟-质量平衡因任务而异,而牺牲质量换取更低延迟能显著提升下游性能。为此,我们提出FPX框架——通过实时需求动态选择模型规模和量化级别的自适应系统。该方法在两个基准测试中均取得最佳表现:在《街头霸王》中获胜率最高提升80%,在交易场景中日收益率最高提升26.52%,这凸显了基于LLM的智能体需要延迟感知的评估与部署策略。研究结果证实了延迟敏感评估框架对现实世界LLM智能体的关键价值。相关基准测试已发布于Latency Sensitive Benchmarks平台。
CODE-DITING: A Reasoning-Based Metric for Functional Alignment in Code Evaluation
Abstract
arXiv:2505.19502v1 Announce Type: cross Abstract: Trustworthy evaluation methods for code snippets play a crucial role in neural code generation. Traditional methods, which either rely on reference solutions or require executable test cases, have inherent limitation in flexibility and scalability. The recent LLM-as-Judge methodology offers a promising alternative by directly evaluating functional consistency between the problem description and the generated code. To systematically understand the landscape of these LLM-as-Judge methods, we conduct a comprehensive empirical study across three diverse datasets. Our investigation reveals the pros and cons of two categories of LLM-as-Judge methods: the methods based on general foundation models can achieve good performance but require complex prompts and lack explainability, while the methods based on reasoning foundation models provide better explainability with simpler prompts but demand substantial computational resources due to their large parameter sizes. To address these limitations, we propose CODE-DITING, a novel code evaluation method that balances accuracy, efficiency and explainability. We develop a data distillation framework that effectively transfers reasoning capabilities from DeepSeek-R1671B to our CODE-DITING 1.5B and 7B models, significantly enhancing evaluation explainability and reducing the computational cost. With the majority vote strategy in the inference process, CODE-DITING 1.5B outperforms all models with the same magnitude of parameters and achieves performance which would normally exhibit in a model with 5 times of parameter scale. CODE-DITING 7B surpasses GPT-4o and DeepSeek-V3 671B, even though it only uses 1% of the parameter volume of these large models. Further experiments show that CODEDITING is robust to preference leakage and can serve as a promising alternative for code evaluation.
摘要
代码片段的可信评估方法在神经代码生成中起着关键作用。传统方法要么依赖参考解决方案,要么需要可执行测试用例,在灵活性和可扩展性方面存在固有局限。新兴的LLM-as-Judge方法通过直接评估问题描述与生成代码之间的功能一致性,提供了有前景的替代方案。为系统理解这类方法的现状,我们在三个不同数据集上开展了全面实证研究。研究发现两类LLM-as-Judge方法的优缺点:基于通用基础模型的方法虽能取得良好性能,但需要复杂提示且缺乏可解释性;而基于推理基础模型的方法通过简单提示即可提供更好可解释性,但由于参数量庞大需要大量计算资源。针对这些局限,我们提出CODE-DITING这一新型代码评估方法,在准确性、效率和可解释性之间实现平衡。我们开发的数据蒸馏框架有效将DeepSeek-R1671B的推理能力迁移至CODE-DITING 1.5B和7B模型,显著提升评估可解释性并降低计算成本。通过推理过程中的多数投票策略,CODE-DITING 1.5B在同等参数规模模型中表现最优,达到通常需要5倍参数规模才能实现的性能。CODE-DITING 7B虽仅使用这些大模型1%的参数体量,却超越了GPT-4o和DeepSeek-V3 671B。进一步实验表明CODEDITING对偏好泄露具有鲁棒性,可作为代码评估的理想替代方案。
DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation
Abstract
arXiv:2505.19504v1 Announce Type: cross Abstract: Large Language Models (LLMs) represent substantial intellectual and economic investments, yet their effectiveness can inadvertently facilitate model imitation via knowledge distillation (KD).In practical scenarios, competitors can distill proprietary LLM capabilities by simply observing publicly accessible outputs, akin to reverse-engineering a complex performance by observation alone. Existing protective methods like watermarking only identify imitation post-hoc, while other defenses assume the student model mimics the teacher's internal logits, rendering them ineffective against distillation purely from observed output text. This paper confronts the challenge of actively protecting LLMs within the realistic constraints of API-based access. We introduce an effective and efficient Defensive Output Generation (DOGe) strategy that subtly modifies the output behavior of an LLM. Its outputs remain accurate and useful for legitimate users, yet are designed to be misleading for distillation, significantly undermining imitation attempts. We achieve this by fine-tuning only the final linear layer of the teacher LLM with an adversarial loss. This targeted training approach anticipates and disrupts distillation attempts during inference time. Our experiments show that, while preserving or even improving the original performance of the teacher model, student models distilled from the defensively generated teacher outputs demonstrate catastrophically reduced performance, demonstrating our method's effectiveness as a practical safeguard against KD-based model imitation.
摘要
大型语言模型(LLMs)作为重大智力与经济投入的成果,其高效性可能无意中通过知识蒸馏(KD)促进模型模仿。在实际场景中,竞争对手仅需观察公开可获取的输出即可蒸馏专有LLM的能力,这类似于仅通过观察来逆向工程复杂表演。现有保护方法(如数字水印)仅能事后识别模仿行为,而其他防御措施则假设学生模型会复制教师模型的内部逻辑值,导致这些方法对纯基于输出文本的蒸馏完全无效。本文针对基于API访问的现实约束条件下主动保护LLMs的挑战,提出了一种高效防御性输出生成(DOGe)策略。该策略通过微妙调整LLM的输出行为,在保证合法用户获得准确有用结果的同时,使输出内容对蒸馏过程具有误导性,从而显著破坏模仿尝试。我们仅通过对抗性损失微调教师LLM的最终线性层实现这一目标,这种针对性训练方法能在推理阶段预判并干扰蒸馏尝试。实验表明:在保持甚至提升教师模型原始性能的同时,从防御性生成的教师输出中蒸馏得到的学生模型性能出现灾难性下降,这证明我们的方法能有效防范基于KD的模型模仿。
Hierarchical Tree Search-based User Lifelong Behavior Modeling on Large Language Model
Abstract
arXiv:2505.19505v1 Announce Type: cross Abstract: Large Language Models (LLMs) have garnered significant attention in Recommendation Systems (RS) due to their extensive world knowledge and robust reasoning capabilities. However, a critical challenge lies in enabling LLMs to effectively comprehend and extract insights from massive user behaviors. Current approaches that directly leverage LLMs for user interest learning face limitations in handling long sequential behaviors, effectively extracting interest, and applying interest in practical scenarios. To address these issues, we propose a Hierarchical Tree Search-based User Lifelong Behavior Modeling framework (HiT-LBM). HiT-LBM integrates Chunked User Behavior Extraction (CUBE) and Hierarchical Tree Search for Interest (HTS) to capture diverse interests and interest evolution of user. CUBE divides user lifelong behaviors into multiple chunks and learns the interest and interest evolution within each chunk in a cascading manner. HTS generates candidate interests through hierarchical expansion and searches for the optimal interest with process rating model to ensure information gain for each behavior chunk. Additionally, we design Temporal-Ware Interest Fusion (TIF) to integrate interests from multiple behavior chunks, constructing a comprehensive representation of user lifelong interests. The representation can be embedded into any recommendation model to enhance performance. Extensive experiments demonstrate the effectiveness of our approach, showing that it surpasses state-of-the-art methods.
摘要
大型语言模型(LLMs)凭借其丰富的世界知识和强大的推理能力,在推荐系统(RS)领域获得了广泛关注。然而,如何使LLMs有效理解并提取海量用户行为中的洞察仍面临关键挑战。现有直接利用LLMs进行用户兴趣学习的方法在处理长序列行为、有效提取兴趣及实际场景应用方面存在局限。为此,我们提出基于层次化树搜索的用户终身行为建模框架(HiT-LBM)。该框架通过分块用户行为提取(CUBE)和层次化兴趣树搜索(HTS)来捕捉用户多样化兴趣及其演化过程。CUBE将用户终身行为划分为多个区块,以级联方式学习每个区块内的兴趣及兴趣演化。HTS通过层次化扩展生成候选兴趣,并利用过程评分模型搜索最优兴趣,确保每个行为区块的信息增益。此外,我们设计时序感知兴趣融合模块(TIF)来整合多行为区块的兴趣,构建用户终身兴趣的完整表征。该表征可嵌入任意推荐模型以提升性能。大量实验证明本方法的有效性,其表现优于当前最先进方法。
Benchmarking Multimodal Knowledge Conflict for Large Multimodal Models
Abstract
arXiv:2505.19509v1 Announce Type: cross Abstract: Large Multimodal Models(LMMs) face notable challenges when encountering multimodal knowledge conflicts, particularly under retrieval-augmented generation(RAG) frameworks where the contextual information from external sources may contradict the model's internal parametric knowledge, leading to unreliable outputs. However, existing benchmarks fail to reflect such realistic conflict scenarios. Most focus solely on intra-memory conflicts, while context-memory and inter-context conflicts remain largely investigated. Furthermore, commonly used factual knowledge-based evaluations are often overlooked, and existing datasets lack a thorough investigation into conflict detection capabilities. To bridge this gap, we propose MMKC-Bench, a benchmark designed to evaluate factual knowledge conflicts in both context-memory and inter-context scenarios. MMKC-Bench encompasses three types of multimodal knowledge conflicts and includes 1,573 knowledge instances and 3,381 images across 23 broad types, collected through automated pipelines with human verification. We evaluate three representative series of LMMs on both model behavior analysis and conflict detection tasks. Our findings show that while current LMMs are capable of recognizing knowledge conflicts, they tend to favor internal parametric knowledge over external evidence. We hope MMKC-Bench will foster further research in multimodal knowledge conflict and enhance the development of multimodal RAG systems. The source code is available at https://github.com/MLLMKCBENCH/MLLMKC.
摘要
大型多模态模型(LMMs)在面临多模态知识冲突时存在显著挑战,特别是在检索增强生成(RAG)框架下,外部来源的上下文信息可能与模型内部参数化知识相矛盾,导致输出结果不可靠。然而现有基准测试未能反映此类现实冲突场景:多数研究仅关注内部记忆冲突,而上下文-记忆冲突与跨上下文冲突领域仍缺乏深入探究。此外,基于事实知识的评估方法常被忽视,现有数据集对冲突检测能力的考察也不够全面。为填补这一空白,我们提出MMKC-Bench基准测试,专门用于评估上下文-记忆和跨上下文场景中的事实知识冲突。该基准涵盖三类多模态知识冲突,包含通过自动化流程采集并经人工校验的1,573个知识实例和3,381张图像,涉及23个广泛类别。我们对三个代表性LMM系列进行了模型行为分析和冲突检测任务评估。研究发现,尽管当前LMMs能够识别知识冲突,但往往更倾向于依赖内部参数化知识而非外部证据。期望MMKC-Bench能促进多模态知识冲突研究的深入,并推动多模态RAG系统的发展。
DocMEdit: Towards Document-Level Model Editing
Abstract
arXiv:2505.19572v1 Announce Type: cross Abstract: Model editing aims to correct errors and outdated knowledge in the Large language models (LLMs) with minimal cost. Prior research has proposed a variety of datasets to assess the effectiveness of these model editing methods. However, most existing datasets only require models to output short phrases or sentences, overlooks the widespread existence of document-level tasks in the real world, raising doubts about their practical usability. Aimed at addressing this limitation and promoting the application of model editing in real-world scenarios, we propose the task of document-level model editing. To tackle such challenges and enhance model capabilities in practical settings, we introduce \benchmarkname, a dataset focused on document-level model editing, characterized by document-level inputs and outputs, extrapolative, and multiple facts within a single edit. We propose a series of evaluation metrics and experiments. The results show that the difficulties in document-level model editing pose challenges for existing model editing methods.
摘要
模型编辑旨在以最小成本修正大语言模型(LLMs)中的错误和过时知识。先前研究提出了多种数据集以评估这些模型编辑方法的有效性。然而,现有数据集大多仅要求模型输出短短语或句子,忽视了现实世界中广泛存在的文档级任务,这引发了对其实际适用性的质疑。为解决这一局限并推动模型编辑在现实场景中的应用,我们提出了文档级模型编辑任务。为应对此类挑战并增强模型在实际环境中的能力,我们引入了\benchmarkname数据集,该数据集专注于文档级模型编辑,其特点包括文档级输入输出、外推性以及单次编辑中包含多重事实。我们提出了一系列评估指标和实验,结果表明文档级模型编辑的难度对现有模型编辑方法构成了挑战。
How Syntax Specialization Emerges in Language Models
Abstract
arXiv:2505.19548v1 Announce Type: cross Abstract: Large language models (LLMs) have been found to develop surprising internal specializations: Individual neurons, attention heads, and circuits become selectively sensitive to syntactic structure, reflecting patterns observed in the human brain. While this specialization is well-documented, how it emerges during training and what influences its development remains largely unknown. In this work, we tap into the black box of specialization by tracking its formation over time. By quantifying internal syntactic consistency across minimal pairs from various syntactic phenomena, we identify a clear developmental trajectory: Syntactic sensitivity emerges gradually, concentrates in specific layers, and exhibits a 'critical period' of rapid internal specialization. This process is consistent across architectures and initialization parameters (e.g., random seeds), and is influenced by model scale and training data. We therefore reveal not only where syntax arises in LLMs but also how some models internalize it during training. To support future research, we will release the code, models, and training checkpoints upon acceptance.
摘要
研究发现大型语言模型(LLMs)会形成令人惊奇的内部专化现象:单个神经元、注意力头和电路会选择性对句法结构产生敏感反应,这种现象与人类大脑中观察到的模式相呼应。尽管这种专化已有充分记录,但其在训练过程中如何形成以及受哪些因素影响仍属未知领域。 本研究通过追踪专化现象的时序形成过程,揭示了其黑箱机制。通过量化不同句法现象最小对比对中的内部句法一致性,我们发现了明确的发展轨迹:句法敏感性逐步显现,集中分布于特定层级,并呈现出一个快速内部专化的"关键期"。该过程在不同架构和初始化参数(如随机种子)中表现一致,同时受模型规模与训练数据影响。因此,我们不仅揭示了句法在LLMs中的形成位置,还阐明了部分模型在训练过程中内化句法的机制。为支持后续研究,我们将在论文录用后公开相关代码、模型及训练检查点。
Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing
Abstract
arXiv:2505.19578v1 Announce Type: cross Abstract: Sparse attention methods exploit the inherent sparsity in attention to speed up the prefilling phase of long-context inference, mitigating the quadratic complexity of full attention computation. While existing sparse attention methods rely on predefined patterns or inaccurate estimations to approximate attention behavior, they often fail to fully capture the true dynamics of attention, resulting in reduced efficiency and compromised accuracy. Instead, we propose a highly accurate sparse attention mechanism that shares similar yet precise attention patterns across heads, enabling a more realistic capture of the dynamic behavior of attention. Our approach is grounded in two key observations: (1) attention patterns demonstrate strong inter-head similarity, and (2) this similarity remains remarkably consistent across diverse inputs. By strategically sharing computed accurate patterns across attention heads, our method effectively captures actual patterns while requiring full attention computation for only a small subset of heads. Comprehensive evaluations demonstrate that our approach achieves superior or comparable speedup relative to state-of-the-art methods while delivering the best overall accuracy.
摘要
稀疏注意力方法利用注意力机制固有的稀疏性来加速长上下文推理的预填充阶段,从而缓解全注意力计算的二次复杂度问题。现有稀疏注意力方法依赖预定义模式或不精确估计来近似注意力行为,往往无法完整捕捉注意力的真实动态,导致效率降低和准确性受损。我们提出了一种高精度稀疏注意力机制,通过在注意力头间共享相似但精确的注意力模式,更真实地捕捉注意力的动态行为。该方法基于两个关键发现:(1) 注意力模式表现出强烈的头间相似性;(2) 这种相似性在不同输入间保持高度一致。通过策略性地在注意力头间共享计算得到的精确模式,我们的方法能有效捕获实际模式,同时仅需对少量头进行全注意力计算。综合评估表明,相较于最先进方法,本方案在取得相当或更优加速比的同时,提供了最佳的整体准确性。
Multi-Agent Collaboration via Evolving Orchestration
Abstract
arXiv:2505.19591v1 Announce Type: cross Abstract: Large language models (LLMs) have achieved remarkable results across diverse downstream tasks, but their monolithic nature restricts scalability and efficiency in complex problem-solving. While recent research explores multi-agent collaboration among LLMs, most approaches rely on static organizational structures that struggle to adapt as task complexity and agent numbers grow, resulting in coordination overhead and inefficiencies. To this end, we propose a puppeteer-style paradigm for LLM-based multi-agent collaboration, where a centralized orchestrator ("puppeteer") dynamically directs agents ("puppets") in response to evolving task states. This orchestrator is trained via reinforcement learning to adaptively sequence and prioritize agents, enabling flexible and evolvable collective reasoning. Experiments on closed- and open-domain scenarios show that this method achieves superior performance with reduced computational costs. Analyses further reveal that the key improvements consistently stem from the emergence of more compact, cyclic reasoning structures under the orchestrator's evolution.
摘要
大语言模型(LLMs)在各类下游任务中取得了显著成果,但其单一性限制了复杂问题解决中的可扩展性和效率。尽管近期研究探索了LLMs间的多智能体协作,但多数方法依赖于静态组织结构,难以随任务复杂度和智能体数量增长而自适应调整,导致协调开销与效率低下。为此,我们提出一种基于LLM的木偶式多智能体协作范式,其中中央协调器("操纵者")能根据动态任务状态实时调度智能体("木偶")。该协调器通过强化学习训练,可自适应地排序和优先调用智能体,实现灵活可进化的集体推理。在封闭域和开放域场景中的实验表明,该方法能以更低计算成本获得更优性能。分析进一步揭示,关键改进始终源于协调器演化过程中涌现出的更紧凑、循环式推理结构。
FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models
Abstract
arXiv:2505.19536v1 Announce Type: cross Abstract: Large vision-language models (LVLMs) excel at multimodal understanding but suffer from high computational costs due to redundant vision tokens. Existing pruning methods typically rely on single-layer attention scores to rank and prune redundant visual tokens to solve this inefficiency. However, as the interaction between tokens and layers is complicated, this raises a basic question: Is such a simple single-layer criterion sufficient to identify redundancy? To answer this question, we rethink the emergence of redundant visual tokens from a fundamental perspective: information flow, which models the interaction between tokens and layers by capturing how information moves between tokens across layers. We find (1) the CLS token acts as an information relay, which can simplify the complicated flow analysis; (2) the redundancy emerges progressively and dynamically via layer-wise attention concentration; and (3) relying solely on attention scores from single layers can lead to contradictory redundancy identification. Based on this, we propose FlowCut, an information-flow-aware pruning framework, mitigating the insufficiency of the current criterion for identifying redundant tokens and better aligning with the model's inherent behaviors. Extensive experiments show that FlowCut achieves superior results, outperforming SoTA by 1.6% on LLaVA-1.5-7B with 88.9% token reduction, and by 4.3% on LLaVA-NeXT-7B with 94.4% reduction, delivering 3.2x speed-up in the prefilling stage. Our code is available at https://github.com/TungChintao/FlowCut
摘要
大型视觉语言模型(LVLMs)在多模态理解方面表现卓越,但由于冗余的视觉标记导致计算成本高昂。现有的剪枝方法通常依赖单层注意力分数来排序和剪枝冗余视觉标记以解决这一效率问题。然而,由于标记与层之间的交互复杂,这引发了一个基本问题:如此简单的单层标准是否足以识别冗余?为回答这一问题,我们从信息流这一基础视角重新思考冗余视觉标记的产生:信息流通过捕捉标记在层间的信息传递方式,建模了标记与层之间的交互。我们发现(1)CLS标记作为信息中继,可简化复杂的信息流分析;(2)冗余通过逐层注意力集中而动态渐进地显现;(3)仅依赖单层注意力分数可能导致矛盾的冗余识别。基于此,我们提出FlowCut,一种信息流感知的剪枝框架,缓解当前标准在识别冗余标记上的不足,并更好地与模型的固有行为对齐。大量实验表明,FlowCut取得了优异的结果,在LLaVA-1.5-7B上以88.9%的标记削减率优于现有最佳方法1.6%,在LLaVA-NeXT-7B上以94.4%的削减率领先4.3%,并在预填充阶段实现了3.2倍的加速。我们的代码发布于https://github.com/TungChintao/FlowCut。
Inconsistent Tokenizations Cause Language Models to be Perplexed by Japanese Grammar
Abstract
arXiv:2505.19599v1 Announce Type: cross Abstract: Typical methods for evaluating the performance of language models evaluate their ability to answer questions accurately. These evaluation metrics are acceptable for determining the extent to which language models can understand and reason about text in a general sense, but fail to capture nuanced capabilities, such as the ability of language models to recognize and obey rare grammar points, particularly in languages other than English. We measure the perplexity of language models when confronted with the "first person psych predicate restriction" grammar point in Japanese. Weblab is the only tested open source model in the 7-10B parameter range which consistently assigns higher perplexity to ungrammatical psych predicate sentences than grammatical ones. We give evidence that Weblab's uniformly bad tokenization is a possible root cause for its good performance, and show that Llama 3's perplexity on grammatical psych predicate sentences can be reduced by orders of magnitude (28x difference) by restricting test sentences to those with uniformly well-behaved tokenizations. We show in further experiments on machine translation tasks that language models will use alternative grammar patterns in order to produce grammatical sentences when tokenization issues prevent the most natural sentence from being output.
摘要
评估语言模型性能的典型方法主要考察其准确回答问题的能力。这类评估指标虽能总体衡量语言模型对文本的理解与推理水平,却难以捕捉细微能力差异,例如模型对罕见语法点(尤其是非英语语言)的识别与遵循能力。本研究通过测量语言模型面对日语"第一人称心理谓词限制"语法点时的困惑度,发现Weblab是7-10B参数范围内唯一始终对不合语法心理谓词句赋予更高困惑度的开源模型。证据表明,Weblab表现优异可能源于其统一的低质量分词处理,实验证明通过限制测试句为分词表现一致的句子,Llama 3对合语法心理谓词句的困惑度可降低28倍量级。在机器翻译任务的进一步实验中,我们发现当分词问题阻碍最自然句式输出时,语言模型会转而采用替代语法模式以生成合语法句子。
Decoupling Spatio-Temporal Prediction: When Lightweight Large Models Meet Adaptive Hypergraphs
Abstract
arXiv:2505.19620v1 Announce Type: cross Abstract: Spatio-temporal prediction is a pivotal task with broad applications in traffic management, climate monitoring, energy scheduling, etc. However, existing methodologies often struggle to balance model expressiveness and computational efficiency, especially when scaling to large real-world datasets. To tackle these challenges, we propose STH-SepNet (Spatio-Temporal Hypergraph Separation Networks), a novel framework that decouples temporal and spatial modeling to enhance both efficiency and precision. Therein, the temporal dimension is modeled using lightweight large language models, which effectively capture low-rank temporal dynamics. Concurrently, the spatial dimension is addressed through an adaptive hypergraph neural network, which dynamically constructs hyperedges to model intricate, higher-order interactions. A carefully designed gating mechanism is integrated to seamlessly fuse temporal and spatial representations. By leveraging the fundamental principles of low-rank temporal dynamics and spatial interactions, STH-SepNet offers a pragmatic and scalable solution for spatio-temporal prediction in real-world applications. Extensive experiments on large-scale real-world datasets across multiple benchmarks demonstrate the effectiveness of STH-SepNet in boosting predictive performance while maintaining computational efficiency. This work may provide a promising lightweight framework for spatio-temporal prediction, aiming to reduce computational demands and while enhancing predictive performance. Our code is avaliable at https://github.com/SEU-WENJIA/ST-SepNet-Lightweight-LLMs-Meet-Adaptive-Hypergraphs.
摘要
时空预测是交通管理、气候监测、能源调度等领域的关键任务。然而现有方法在模型表达能力与计算效率的平衡上存在不足,尤其难以适应大规模现实数据集。为此,我们提出STH-SepNet(时空超图分离网络),通过解耦时空建模来提升效率与精度。该框架采用轻量级大语言模型捕捉低秩时间动态,同时通过自适应超图神经网络动态构建超边以建模复杂高阶空间交互,并设计门控机制实现时空表征的有机融合。基于低秩时间动态与空间交互的基本原理,STH-SepNet为实际应用提供了高效可扩展的时空预测方案。在多基准的大规模现实数据集实验中,该方法在保持计算效率的同时显著提升了预测性能。本研究为时空预测提供了一个有望降低计算成本并提升预测性能的轻量级框架。代码已开源:https://github.com/SEU-WENJIA/ST-SepNet-Lightweight-LLMs-Meet-Adaptive-Hypergraphs。
Skrull: Towards Efficient Long Context Fine-tuning through Dynamic Data Scheduling
Abstract
arXiv:2505.19609v1 Announce Type: cross Abstract: Long-context supervised fine-tuning (Long-SFT) plays a vital role in enhancing the performance of large language models (LLMs) on long-context tasks. To smoothly adapt LLMs to long-context scenarios, this process typically entails training on mixed datasets containing both long and short sequences. However, this heterogeneous sequence length distribution poses significant challenges for existing training systems, as they fail to simultaneously achieve high training efficiency for both long and short sequences, resulting in sub-optimal end-to-end system performance in Long-SFT. In this paper, we present a novel perspective on data scheduling to address the challenges posed by the heterogeneous data distributions in Long-SFT. We propose Skrull, a dynamic data scheduler specifically designed for efficient long-SFT. Through dynamic data scheduling, Skrull balances the computation requirements of long and short sequences, improving overall training efficiency. Furthermore, we formulate the scheduling process as a joint optimization problem and thoroughly analyze the trade-offs involved. Based on those analysis, Skrull employs a lightweight scheduling algorithm to achieve near-zero cost online scheduling in Long-SFT. Finally, we implement Skrull upon DeepSpeed, a state-of-the-art distributed training system for LLMs. Experimental results demonstrate that Skrull outperforms DeepSpeed by 3.76x on average (up to 7.54x) in real-world long-SFT scenarios.
摘要
长上下文监督微调(Long-SFT)对于提升大语言模型(LLM)在长上下文任务中的表现至关重要。为使LLM顺利适应长上下文场景,该过程通常需要在包含长短序列的混合数据集上进行训练。然而,这种异构序列长度分布对现有训练系统提出了重大挑战,因为它们无法同时实现长短序列的高训练效率,导致Long-SFT的端到端系统性能欠佳。本文提出一种新颖的数据调度视角,以解决Long-SFT中异构数据分布带来的挑战。我们设计了Skrull——一个专为高效长上下文微调而设计的动态数据调度器。通过动态数据调度,Skrull平衡了长短序列的计算需求,从而提升整体训练效率。此外,我们将调度过程建模为联合优化问题,并深入分析其中的权衡关系。基于这些分析,Skrull采用轻量级调度算法,在Long-SFT中实现近乎零成本的在线调度。最后,我们在最先进的LLM分布式训练系统DeepSpeed上实现了Skrull。实验结果表明,在实际长上下文微调场景中,Skrull平均性能超越DeepSpeed 3.76倍(最高达7.54倍)。
Preference Optimization by Estimating the Ratio of the Data Distribution
Abstract
arXiv:2505.19601v1 Announce Type: cross Abstract: Direct preference optimization (DPO) is widely used as a simple and stable method for aligning large language models (LLMs) with human preferences. This paper investigates a generalized DPO loss that enables a policy model to match the target policy from a likelihood ratio estimation perspective. The ratio of the target policy provides a unique identification of the policy distribution without relying on reward models or partition functions. This allows the generalized loss to retain both simplicity and theoretical guarantees, which prior work such as -PO fails to achieve simultaneously. We propose Bregman preference optimization (BPO), a generalized framework for ratio matching that provides a family of objective functions achieving target policy optimality. BPO subsumes DPO as a special case and offers tractable forms for all instances, allowing implementation with a few lines of code. We further develop scaled Basu's power divergence (SBA), a gradient scaling method that can be used for BPO instances. The BPO framework complements other DPO variants and is applicable to target policies defined by these variants. In experiments, unlike other probabilistic loss extensions such as -DPO or -PO, which exhibit a trade-off between generation fidelity and diversity, instances of BPO improve both win rate and entropy compared with DPO. When applied to Llama-3-Instruct-8B, BPO achieves state-of-the-art performance among Llama-3-8B backbones, with a 55.9% length-controlled win rate on AlpacaEval2.
摘要
直接偏好优化(DPO)作为一种简单稳定的方法,被广泛用于将大语言模型(LLM)与人类偏好对齐。本文从似然比估计的角度出发,研究了一种广义DPO损失函数,使策略模型能够匹配目标策略。目标策略的比率提供了策略分布的唯一标识,无需依赖奖励模型或配分函数。这使得广义损失既能保持简洁性,又具有理论保证,而先前工作如-PO无法同时实现这两点。我们提出Bregman偏好优化(BPO),这是一个用于比率匹配的广义框架,提供了一组实现目标策略最优性的目标函数族。BPO将DPO作为特例包含其中,并为所有实例提供了易处理的形式,仅需几行代码即可实现。我们进一步开发了缩放Basu幂散度(SBA),这是一种可用于BPO实例的梯度缩放方法。BPO框架与其他DPO变体互补,并适用于由这些变体定义的目标策略。实验表明,与-DPO或-PO等概率损失扩展不同(这些方法在生成保真度与多样性之间存在权衡),BPO实例在胜率和熵两方面均优于DPO。当应用于Llama-3-Instruct-8B时,BPO在Llama-3-8B骨干模型中实现了最先进的性能,在AlpacaEval2上达到55.9%的长度控制胜率。
Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models
Abstract
arXiv:2505.19616v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across tasks, yet they often exhibit difficulty in distinguishing task-relevant from irrelevant signals, particularly in tasks like Visual Question Answering (VQA), which can lead to susceptibility to misleading or spurious inputs. We refer to this broader limitation as the Cross-Modality Competency Problem: the model's inability to fairly evaluate all modalities. This vulnerability becomes more evident in modality-specific tasks such as image classification or pure text question answering, where models are expected to rely solely on one modality. In such tasks, spurious information from irrelevant modalities often leads to significant performance degradation. We refer to this failure as Modality Interference, which serves as a concrete and measurable instance of the cross-modality competency problem. We further design a perturbation-based causal diagnostic experiment to verify and quantify this problem. To mitigate modality interference, we propose a novel framework to fine-tune MLLMs, including perturbation-based data augmentations with both heuristic perturbations and adversarial perturbations via Projected Gradient Descent (PGD), and a consistency regularization strategy applied to model outputs with original and perturbed inputs. Experiments on multiple benchmark datasets (image-heavy, text-heavy, and VQA tasks) and multiple model families with different scales demonstrate significant improvements in robustness and cross-modality competency, indicating our method's effectiveness in boosting unimodal reasoning ability while enhancing performance on multimodal tasks.
摘要
多模态大语言模型(MLLMs)已在各类任务中展现出卓越能力,但其常难以区分任务相关与无关信号,尤其在视觉问答(VQA)等任务中易受误导性或伪相关输入的干扰。我们将这一广义局限称为跨模态能力问题:模型无法公平评估所有模态。该缺陷在图像分类或纯文本问答等单模态任务中更为显著——此类任务本需模型仅依赖单一模态,而无关模态的干扰信息常导致性能显著下降。我们将此失效现象定义为模态干扰,其作为跨模态能力问题的具体可量化实例。我们进一步设计了基于扰动的因果诊断实验以验证和量化该问题。为缓解模态干扰,提出新型MLLMs微调框架:包含基于投影梯度下降(PGD)的启发式扰动与对抗扰动的数据增强方法,以及对原始输入与扰动输入采用输出一致性正则化策略。在多个基准数据集(图像主导型、文本主导型及VQA任务)和不同规模模型族上的实验表明,该方法能显著提升模型鲁棒性与跨模态能力,证实其既可增强单模态推理能力,又能提升多模态任务性能的有效性。
AgentRecBench: Benchmarking LLM Agent-based Personalized Recommender Systems
Abstract
arXiv:2505.19623v1 Announce Type: cross Abstract: The emergence of agentic recommender systems powered by Large Language Models (LLMs) represents a paradigm shift in personalized recommendations, leveraging LLMs' advanced reasoning and role-playing capabilities to enable autonomous, adaptive decision-making. Unlike traditional recommendation approaches, agentic recommender systems can dynamically gather and interpret user-item interactions from complex environments, generating robust recommendation strategies that generalize across diverse scenarios. However, the field currently lacks standardized evaluation protocols to systematically assess these methods. To address this critical gap, we propose: (1) an interactive textual recommendation simulator incorporating rich user and item metadata and three typical evaluation scenarios (classic, evolving-interest, and cold-start recommendation tasks); (2) a unified modular framework for developing and studying agentic recommender systems; and (3) the first comprehensive benchmark comparing 10 classical and agentic recommendation methods. Our findings demonstrate the superiority of agentic systems and establish actionable design guidelines for their core components. The benchmark environment has been rigorously validated through an open challenge and remains publicly available with a continuously maintained leaderboard~\footnote[2]{https://tsinghua-fib-lab.github.io/AgentSocietyChallenge/pages/overview.html}, fostering ongoing community engagement and reproducible research. The benchmark is available at: \hyperlink{https://huggingface.co/datasets/SGJQovo/AgentRecBench}{https://huggingface.co/datasets/SGJQovo/AgentRecBench}.
摘要
基于大语言模型(LLM)的智能推荐系统标志着个性化推荐领域的范式转变,其通过LLM的高级推理与角色扮演能力实现自主、自适应的决策机制。与传统推荐方法不同,智能推荐系统能够动态收集并解析复杂环境中的用户-项目交互数据,生成具有跨场景泛化能力的鲁棒推荐策略。然而,该领域目前缺乏系统评估这些方法的标准化协议。为填补这一关键空白,我们提出:(1)集成丰富用户与项目元数据的交互式文本推荐模拟器,包含三种典型评估场景(经典推荐任务、兴趣演化任务和冷启动任务);(2)用于开发和研究智能推荐系统的统一模块化框架;(3)首个全面对比10种经典方法与智能推荐方法的基准测试。研究结果验证了智能系统的优越性,并为其核心组件制定了可操作的设计准则。该基准环境已通过公开挑战赛严格验证,并保持公开可访问的持续维护排行榜,以促进学界持续参与和可重复研究。基准测试地址详见:https://huggingface.co/datasets/SGJQovo/AgentRecBench。
Segment First or Comprehend First? Explore the Limit of Unsupervised Word Segmentation with Large Language Models
Abstract
arXiv:2505.19631v1 Announce Type: cross Abstract: Word segmentation stands as a cornerstone of Natural Language Processing (NLP). Based on the concept of "comprehend first, segment later", we propose a new framework to explore the limit of unsupervised word segmentation with Large Language Models (LLMs) and evaluate the semantic understanding capabilities of LLMs based on word segmentation. We employ current mainstream LLMs to perform word segmentation across multiple languages to assess LLMs' "comprehension". Our findings reveal that LLMs are capable of following simple prompts to segment raw text into words. There is a trend suggesting that models with more parameters tend to perform better on multiple languages. Additionally, we introduce a novel unsupervised method, termed LLACA (\textbf{L}arge \textbf{L}anguage Model-Inspired \textbf{A}ho-\textbf{C}orasick \textbf{A}utomaton). Leveraging the advanced pattern recognition capabilities of Aho-Corasick automata, LLACA innovatively combines these with the deep insights of well-pretrained LLMs. This approach not only enables the construction of a dynamic -gram model that adjusts based on contextual information but also integrates the nuanced understanding of LLMs, offering significant improvements over traditional methods. Our source code is available at https://github.com/hkr04/LLACA
摘要
分词是自然语言处理(NLP)的基石任务。基于"先理解,后切分"的理念,我们提出一个新框架来探索大语言模型(LLMs)在无监督分词任务中的性能极限,并通过分词任务评估LLMs的语义理解能力。我们采用当前主流LLMs在多种语言上进行分词实验以评估其"理解"能力。研究发现,LLMs能够遵循简单指令将原始文本切分为词语,且存在参数量越大的模型在多语言任务中表现越优的趋势。此外,我们提出了一种创新的无监督方法LLACA(大语言模型启发的Aho-Corasick自动机),该方法巧妙结合了Aho-Corasick自动机的高效模式识别能力和预训练LLMs的深层语义理解优势。LLACA不仅能构建基于上下文动态调整的n-gram模型,还融合了LLMs的细粒度语义理解,相较传统方法实现了显著提升。项目源代码已开源:https://github.com/hkr04/LLACA
Large Language Models in Code Co-generation for Safe Autonomous Vehicles
Abstract
arXiv:2505.19658v1 Announce Type: cross Abstract: Software engineers in various industrial domains are already using Large Language Models (LLMs) to accelerate the process of implementing parts of software systems. When considering its potential use for ADAS or AD systems in the automotive context, there is a need to systematically assess this new setup: LLMs entail a well-documented set of risks for safety-related systems' development due to their stochastic nature. To reduce the effort for code reviewers to evaluate LLM-generated code, we propose an evaluation pipeline to conduct sanity-checks on the generated code. We compare the performance of six state-of-the-art LLMs (CodeLlama, CodeGemma, DeepSeek-r1, DeepSeek-Coders, Mistral, and GPT-4) on four safety-related programming tasks. Additionally, we qualitatively analyse the most frequent faults generated by these LLMs, creating a failure-mode catalogue to support human reviewers. Finally, the limitations and capabilities of LLMs in code generation, and the use of the proposed pipeline in the existing process, are discussed.
摘要
各工业领域的软件工程师已开始使用大语言模型(LLM)来加速软件系统部分模块的实现过程。在考虑将其应用于汽车领域的ADAS或AD系统时,需要系统评估这一新方案:由于LLM的随机性特性,其在安全相关系统开发中存在一系列明确记录的风险。为降低代码审查人员评估LLM生成代码的工作量,我们提出了一种用于执行生成代码完整性检查的评估流程。本研究对比了六种前沿LLM(CodeLlama、CodeGemma、DeepSeek-r1、DeepSeek-Coders、Mistral和GPT-4)在四项安全相关编程任务中的表现。此外,我们通过定性分析这些LLM生成的最常见错误,建立了故障模式分类目录以支持人工审查。最后,本文探讨了LLM在代码生成方面的局限性与能力,以及所提评估流程在现有开发过程中的应用价值。
Automated evaluation of children's speech fluency for low-resource languages
Abstract
arXiv:2505.19671v1 Announce Type: cross Abstract: Assessment of children's speaking fluency in education is well researched for majority languages, but remains highly challenging for low resource languages. This paper proposes a system to automatically assess fluency by combining a fine-tuned multilingual ASR model, an objective metrics extraction stage, and a generative pre-trained transformer (GPT) network. The objective metrics include phonetic and word error rates, speech rate, and speech-pause duration ratio. These are interpreted by a GPT-based classifier guided by a small set of human-evaluated ground truth examples, to score fluency. We evaluate the proposed system on a dataset of children's speech in two low-resource languages, Tamil and Malay and compare the classification performance against Random Forest and XGBoost, as well as using ChatGPT-4o to predict fluency directly from speech input. Results demonstrate that the proposed approach achieves significantly higher accuracy than multimodal GPT or other methods.
摘要
在教育领域,针对主流语言的儿童口语流畅度评估已有深入研究,但在资源匮乏语言中仍面临巨大挑战。本文提出一种自动评估系统,通过结合微调的多语言自动语音识别(ASR)模型、客观指标提取阶段以及生成式预训练变换器(GPT)网络来实现流畅度评估。客观指标包括音素错误率、单词错误率、语速及语音停顿时长比。这些指标由基于GPT的分类器进行解释,该分类器通过少量人工评估的真实样例进行引导,最终输出流畅度评分。我们在两种低资源语言(泰米尔语和马来语)的儿童语音数据集上评估了所提系统,并将分类性能与随机森林、XGBoost以及直接使用ChatGPT-4o从语音输入预测流畅度的方法进行对比。结果表明,所提方法比多模态GPT或其他方法实现了显著更高的准确率。
MoESD: Unveil Speculative Decoding's Potential for Accelerating Sparse MoE
Abstract
arXiv:2505.19645v1 Announce Type: cross Abstract: Large Language Models (LLMs) have achieved remarkable success across many applications, with Mixture of Experts (MoE) models demonstrating great potential. Compared to traditional dense models, MoEs achieve better performance with less computation. Speculative decoding (SD) is a widely used technique to accelerate LLM inference without accuracy loss, but it has been considered efficient only for dense models. In this work, we first demonstrate that, under medium batch sizes, MoE surprisingly benefits more from SD than dense models. Furthermore, as MoE becomes sparser -- the prevailing trend in MoE designs -- the batch size range where SD acceleration is expected to be effective becomes broader. To quantitatively understand tradeoffs involved in SD, we develop a reliable modeling based on theoretical analyses. While current SD research primarily focuses on improving acceptance rates of algorithms, changes in workload and model architecture can still lead to degraded SD acceleration even with high acceptance rates. To address this limitation, we introduce a new metric 'target efficiency' that characterizes these effects, thus helping researchers identify system bottlenecks and understand SD acceleration more comprehensively. For scenarios like private serving, this work unveils a new perspective to speed up MoE inference, where existing solutions struggle. Experiments on different GPUs show up to 2.29x speedup for Qwen2-57B-A14B at medium batch sizes and validate our theoretical predictions.
摘要
大型语言模型(LLMs)已在众多应用中取得显著成功,其中混合专家(MoE)模型展现出巨大潜力。与传统密集模型相比,MoE能以更少计算量实现更优性能。推测解码(SD)作为无需牺牲精度的LLM推理加速技术被广泛采用,但此前仅被认为对密集模型有效。本研究首先揭示:在中等批量大小下,MoE从SD中获得的加速效益意外优于密集模型;且随着MoE稀疏化程度提升(当前设计的主流趋势),SD加速有效的批量大小范围将进一步扩大。为量化分析SD的权衡机制,我们基于理论分析建立了可靠建模框架。现有SD研究主要聚焦算法接受率的提升,但工作负载与模型架构的变化仍可能导致高接受率下的SD加速效果下降。针对这一局限,我们提出"目标效率"新指标来表征这些影响,帮助研究者系统定位瓶颈并全面理解SD加速机制。对于私有服务等现有解决方案乏力的场景,本研究为加速MoE推理提供了新视角。在不同GPU上的实验表明,Qwen2-57B-A14B模型在中等批量大小下最高可获得2.29倍加速,验证了理论预测的正确性。